Elevated error rate being investigated.
Incident Report for Aventri
Postmortem

Summary

On October 14, 2019, multiple Aventri clients reported that they were getting an error message when accessing the back end and front end of Aventri in North America (I.e attendee registration). The Aventri Support Team picked up the ticket immediately and was able to verify the issue. The Aventri Support Team then forwarded the ticket on to the Aventri DevOps Team using its top priority. Within 22 minutes, the Aventri DevOps Team was able to identify and resolve the issue. The Aventri DevOps Team then informed the Aventri Support Team of the resolution. The Aventri Support Team verified the resolution on their own and then informed our clients, at which point the ticket was closed.

Why it happened

As part of the investigation of this issue, the DevOps Team did a root cause analysis and determined that this issue was the result of an overly restrictive database max connections setting. This setting is used to optimize the performance of our database. When the database started experiencing heavy load, the maximum number of connections was hit and the database began rejecting additional connections resulting in the errors being seen by our clients.

What we did about it

As the database approached the max connections setting value, the Aventri DevOps Team was alerted and began diagnosing the issue. By the time the client began seeing issues and reported the issue, the Aventri DevOps Team had already diagnosed the issue and began resolving the issue resulting in the quick resolution time. The Aventri DevOps Team began by killing off sleeping connections to lower the overall connections to under the max connections threshold. The Aventri DevOps Team also raised the max connection setting. Once we had killed off a sufficient number of sleeping connections and the updated max connections setting took effect, the platform began running properly. The Aventri DevOps Team then reassigned team members to actively watch the database connection and active process levels to ensure that this issue was resolved and did not reoccur.

Corrective and Preventative Measures

As a result of the root cause analysis associated with this issue, Aventri is making a series of changes to the platform database environment.

  • The Aventri DevOps Team has made the amended max connections setting permanent in all server environments.
  • The Aventri DevOps Team adjusted the connection wait timeout to reduce the number of sleeping connections as any given time.
  • The Aventri DevOps Team add monitoring and alerting around the number of active processes on the platform database as an addition safeguard to make sure the team identifies and resolves any potential future issues before they impact clients.
Posted Oct 15, 2019 - 21:44 EDT

Resolved
After extensive monitoring we are resolving this incident. A post-mortem will be published shortly.
Posted Oct 15, 2019 - 12:35 EDT
Update
We are updating this incident status to Monitoring as we've seen all metrics running at expected operational levels. We will remain in this state for a little while longer while we continue to keep an eye on things.
Posted Oct 14, 2019 - 11:17 EDT
Monitoring
Affected customers should now be able to access all parts of the system. We are changing status to Degraded Performance while we continue to monitor.
Posted Oct 14, 2019 - 11:09 EDT
Investigating
We're experiencing an elevated level of errors for [API, Mobile, Registration, Admin Tools] and are currently looking into the issue. We will continue to post updates throughout the process.
Posted Oct 14, 2019 - 10:59 EDT
This incident affected: Web and Public API.