Partial Service Disruption
Incident Report for Aventri
Postmortem

It has been a difficult week for many of our customers due to several partial outages in our system availability and we recognize the disruption this can create for your business. We pride ourselves on our 99.99%+ uptime performance over the past three years and know that this stability is what you have come to expect and deserve; please accept our sincere apologies for not meeting this standard this week.

We have provided below a more technical explanation of what occurred, how we addressed it, and some lessons learned on how we can both avoid these types of issues in the future as well as how we can better communicate to you during times like this. For any questions or concerns, please don't hesitate to engage our technical support and account management teams, who are committed to your success.

Summary

On Wednesday Sept 24, at approximately 8:30am Eastern Time (all times referenced in this document will be US Eastern Time) we experienced the first of a handful of individual service disruptions that could have impacted customers hosted in our North America regions (EMEA and APAC regions were unaffected throughout). At that time our DevOps team had received and was already responding to automated monitoring alerts that were triggered because one of our server clusters was reporting high CPU utilization. Our SOP for these situations is to gather telemetry data if possible from the servers (telemetry = system stats for the non-technical readers) and to ensure they are either rebooted automatically or manually if needed. These reboots occur in a rolling manner over a short time period rather than all at once. During this reboot process we noticed that the servers in this cluster were not successfully restarting. At this point we immediately opened a ticket with our hosting provider (we use public cloud providers). We are subscribed to the provider's highest level of Enterprise Support program so that in cases like this when we are dependent on them for a resolution we can be sure we are getting the speediest response available. They advised us that the file systems for that cluster of servers was corrupted and assisted us with remounting the drives and fixing the corruption. While this was in progress, other members of the DevOps team began to analyze usage logs since there were no changes to our software or infrastructure leading up to this point. After all servers had been brought back online our monitoring reported no issues and all operations and services had been restored to normal and the attention was focused on discovering the cause of the spikes.

About 4.5 hours later that same day once again our monitoring alerts were triggered and the problem presented itself in the same manner with overutilization of some server CPUs. This time the reboot process functioned as expected and all servers were successfully brought back online after a short time. Since we had not been able to discover any abnormal log patterns from the first incident we knew that there could be a systemic reason for these disruptions. This is when we discovered that the root cause was a database middleware we use to ironically increase performance and reliability and that had been in production successfully for over a year. Since this is a third party component we contacted the vendor immediately and opened a ticket for them to help us understand why it was causing overutilization.

Over the next two days we failed to completely fix the problem as the steps we were being given by the vendor, while appearing to help, continued to cause the same issues. When a third incident occurred it was at this point that the decision was made to completely remove the 3rd party software and reconfigure all of our servers to bypass it. Once that was done we began to see normal operating behavior from the platform and we felt that while we hadn't resolved the root cause, we had stabilized the platform enough to avoid future occurrences.

Unfortunately, this isn't where the story ends because there was another brief service disruption the following day (yesterday) exhibiting the very same symptoms as before only this time with the 3rd party component completely removed from the equation, or so we thought... After running through the same server reboot/recovery process we noticed the component was unexpectedly reinstalled! The reason why this happened is because in the time since the component was manually removed by the DevOps team, an automated build had been triggered as part of our continuous integration (CI) build process and it automatically reinstalled the software and reconfigured the servers to their old state. This was a total miss on our part and has highlighted to us some improvements we need to make in our CI process to account for abnormal environmental conditions.

Today we are certain the CPU overutilization that led to the service disruptions were caused by the database middleware as evidenced by the sequence of events that occurred. Even though this repeatedly affected only some of our servers we have removed the component from all servers until such a time as we can understand what happened using help from the software vendor. We do not anticipate the temporary removal of this component to cause any issues as it not a critical piece of our platform operation, but there to enhance it only.

During this process we saw clearly how we can improve our service and equally important, how we can improve our communications if incidents like this ever occur. We are grateful that some of our customers have given us feedback that they appreciated our efforts to resolve the problem, however, it was clear that we should've communicated much better while it was occurring. To this end, our executive team is immediately working cross-departmentally to improve our status reporting with more timely and meaningful updates. The DevOps team is also working on an immediate plan to change some of our automated processes and notifications to better equip us to handle these types of incidents with confidence and expedience.

Posted 26 days ago. Sep 27, 2019 - 16:53 EDT

Resolved
We are closing this incident and will be posting the post mortem in a few minutes. Once again, please accept our deepest apologies and thank you for your patience with us during this very frustrating time.
Posted 26 days ago. Sep 27, 2019 - 16:45 EDT
Update
All systems are functioning as expected, we will continue to monitor and provide updates.
Posted 26 days ago. Sep 27, 2019 - 13:18 EDT
Update
We are receiving notifications from clients that are experiencing difficulties accessing the system, the teams are investigating, we apologize for any inconvenience.
Posted 26 days ago. Sep 27, 2019 - 13:06 EDT
Update
The team has disabled the faulty component across the affected servers and customers that were impacted should now be operating normally. As an added measure we are also disabling the same component across all other clusters even if they were not impacted. We are updating status to Operational but will continue work on the issue and provide a full post mortem of these recent related incidents.
Posted 27 days ago. Sep 26, 2019 - 17:01 EDT
Monitoring
The DevOps team are currently bringing services back online. We will continue to post updates every 10 minutes until services are restored.
Posted 27 days ago. Sep 26, 2019 - 16:39 EDT
Investigating
Unfortunately we continue to experience similar incidents of partial outages for our North American customers. The DevOps team is actively working to bring the affected services and customers online.
Posted 27 days ago. Sep 26, 2019 - 16:33 EDT
Monitoring
Affected systems are now operational. We are now monitoring status and working with our hosting provider to confirm root cause.
Posted 28 days ago. Sep 25, 2019 - 13:18 EDT
Identified
We are currently working with our hosting provider to resolve connectivity issues that some of our customers in North America are experiencing. If you are unaffected by this incident you may disregard it as its impacting only a fixed segment of users.
Posted 28 days ago. Sep 25, 2019 - 11:42 EDT
This incident affected: Web and Public API.