Investigating - Degraded Performance - US instance
Incident Report for Aventri
Postmortem

Incident Report for Partial Outage Occurring on August 29, 2018 and August 30, 2018 in North America

**Report Date: 20180903

**

On August 29th & 30th 2018, we experienced two separate partial outages of our EMS software platform in North America. Today we’re providing an incident report that details the nature of the incidents and our responses. The following is the incident report for the aforementioned incidents. We understand this service issue has impacted our valued customers, and we apologize to everyone who was affected. 

Issue Summary

On August 29th and 30th, requests to the Aventri registration platform were interrupted for 23 minutes and 38 minutes respectively. This impacted Registration, Back-End access, and APIs. The catalyst that led to this partial outage was extremely high spikes (over 20x our maximum reserve capacity) in inbound traffic that we failed to manage properly.

Affected Customers

Customers and registrants in North America that attempted to access our platform received intermittent errors or timeouts during the duration of the incident. Mobile apps were still available but were not able to synchronize data during the incidents. All other customers in all other regions were not impacted.

Timeline (all times Eastern Time)

Aug 29 15:00 ET: Traffic spike began

Aug 29 15:01 ET: Platform became slow and intermittently unresponsive and our Ops team began to investigate and work directly with the customer that originated the traffic

Aug 29 15:20 ET: We determined the fastest was to resolve the issue at the time was to work directly with our customer to re-plan their registration launch and the event was taken offline

Aug 29 15:23 ET: Normal traffic resumed and all affected services recovered. We agreed with the customer that their registration launch would be delayed for 24 hours to allow us to better prepare and isolate their traffic spike

Aug 30 15:00 ET: Customer's event was re-launched and as expected, traffic spiked again

Aug 30 15:01 ET: Platform became slow and intermittently unresponsive. Since our Ops team was expecting the traffic they immediately began to investigate why the traffic queuing was not working properly.

Aug 30 15:05 ET: Ops team saw that our primary master database was not responding quickly enough and database connections began to accumulate beyond our spare capacity. At this time the Ops team began stopping long running connections to try to lower the connection count

Aug 30 15:38 ET: Connection count was lowered and the database began to service requests within capacity. At this time normal operations resumed and no additional incidents occurred afterwards.

Root Cause

The traffic spike, while being the trigger, was not the root cause. We had prepared with our customer, understanding there would be a spike in traffic and had put a traffic queuing system in place to handle the load. On Aug 29th, when the system was impacted by the traffic spike, we had diagnosed the issue as a bug in a stats tracking feature becoming overloaded and over the next 24 hours the dev team released a patch to fix the bug. On the 30th when the launch was re-executed the patch helped reduce the load but unfortunately it revealed that the queuing system was misconfigured by our staff and the bulk of the customer traffic was not routed through the queue.

Resolution and recovery

As mentioned above, we worked directly with our customer before and after the incidents and put in place protective systems like the queue. While we believe these measures would have been successful, ultimately the human error in configuration was not discovered in time. We were, however, able to recover the system on the 30th due to efforts of our ops team to triage and stabilize our master database.

Corrective and Preventative Measures

Massive traffic surges are challenging to overcome, but we believe the combination of direct collaboration and preparation with our customers for period of known high volume to design the best strategy to deliver the service without interruption while still maintaining the user experience is an effective methodology. We continue to practice this methodology and are creating plans to make the queuing system  automated to protect against unpredictable traffic spike.

Posted about 1 year ago. Sep 07, 2018 - 13:08 EDT

Resolved
This incident has been resolved.
Posted about 1 year ago. Aug 30, 2018 - 16:16 EDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted about 1 year ago. Aug 30, 2018 - 16:14 EDT
Investigating
We are currently investigating this issue.
Posted about 1 year ago. Aug 30, 2018 - 15:35 EDT
This incident affected: Web.