On August 29th & 30th 2018, we experienced two separate partial outages of our EMS software platform in North America. Today we’re providing an incident report that details the nature of the incidents and our responses. The following is the incident report for the aforementioned incidents. We understand this service issue has impacted our valued customers, and we apologize to everyone who was affected.
On August 29th and 30th, requests to the Aventri registration platform were interrupted for 23 minutes and 38 minutes respectively. This impacted Registration, Back-End access, and APIs. The catalyst that led to this partial outage was extremely high spikes (over 20x our maximum reserve capacity) in inbound traffic that we failed to manage properly.
Customers and registrants in North America that attempted to access our platform received intermittent errors or timeouts during the duration of the incident. Mobile apps were still available but were not able to synchronize data during the incidents. All other customers in all other regions were not impacted.
Aug 29 15:00 ET: Traffic spike began
Aug 29 15:01 ET: Platform became slow and intermittently unresponsive and our Ops team began to investigate and work directly with the customer that originated the traffic
Aug 29 15:20 ET: We determined the fastest was to resolve the issue at the time was to work directly with our customer to re-plan their registration launch and the event was taken offline
Aug 29 15:23 ET: Normal traffic resumed and all affected services recovered. We agreed with the customer that their registration launch would be delayed for 24 hours to allow us to better prepare and isolate their traffic spike
Aug 30 15:00 ET: Customer's event was re-launched and as expected, traffic spiked again
Aug 30 15:01 ET: Platform became slow and intermittently unresponsive. Since our Ops team was expecting the traffic they immediately began to investigate why the traffic queuing was not working properly.
Aug 30 15:05 ET: Ops team saw that our primary master database was not responding quickly enough and database connections began to accumulate beyond our spare capacity. At this time the Ops team began stopping long running connections to try to lower the connection count
Aug 30 15:38 ET: Connection count was lowered and the database began to service requests within capacity. At this time normal operations resumed and no additional incidents occurred afterwards.
The traffic spike, while being the trigger, was not the root cause. We had prepared with our customer, understanding there would be a spike in traffic and had put a traffic queuing system in place to handle the load. On Aug 29th, when the system was impacted by the traffic spike, we had diagnosed the issue as a bug in a stats tracking feature becoming overloaded and over the next 24 hours the dev team released a patch to fix the bug. On the 30th when the launch was re-executed the patch helped reduce the load but unfortunately it revealed that the queuing system was misconfigured by our staff and the bulk of the customer traffic was not routed through the queue.
As mentioned above, we worked directly with our customer before and after the incidents and put in place protective systems like the queue. While we believe these measures would have been successful, ultimately the human error in configuration was not discovered in time. We were, however, able to recover the system on the 30th due to efforts of our ops team to triage and stabilize our master database.
Massive traffic surges are challenging to overcome, but we believe the combination of direct collaboration and preparation with our customers for period of known high volume to design the best strategy to deliver the service without interruption while still maintaining the user experience is an effective methodology. We continue to practice this methodology and are creating plans to make the queuing system automated to protect against unpredictable traffic spike.