It has been a difficult week for many of our customers due to several partial outages in our system availability and we recognize the disruption this can create for your business. We pride ourselves on our 99.99%+ uptime performance over the past three years and know that this stability is what you have come to expect and deserve; please accept our sincere apologies for not meeting this standard this week.
We have provided below a more technical explanation of what occurred, how we addressed it, and some lessons learned on how we can both avoid these types of issues in the future as well as how we can better communicate to you during times like this. For any questions or concerns, please don't hesitate to engage our technical support and account management teams, who are committed to your success.
On Wednesday Sept 24, at approximately 8:30am Eastern Time (all times referenced in this document will be US Eastern Time) we experienced the first of a handful of individual service disruptions that could have impacted customers hosted in our North America regions (EMEA and APAC regions were unaffected throughout). At that time our DevOps team had received and was already responding to automated monitoring alerts that were triggered because one of our server clusters was reporting high CPU utilization. Our SOP for these situations is to gather telemetry data if possible from the servers (telemetry = system stats for the non-technical readers) and to ensure they are either rebooted automatically or manually if needed. These reboots occur in a rolling manner over a short time period rather than all at once. During this reboot process we noticed that the servers in this cluster were not successfully restarting. At this point we immediately opened a ticket with our hosting provider (we use public cloud providers). We are subscribed to the provider's highest level of Enterprise Support program so that in cases like this when we are dependent on them for a resolution we can be sure we are getting the speediest response available. They advised us that the file systems for that cluster of servers was corrupted and assisted us with remounting the drives and fixing the corruption. While this was in progress, other members of the DevOps team began to analyze usage logs since there were no changes to our software or infrastructure leading up to this point. After all servers had been brought back online our monitoring reported no issues and all operations and services had been restored to normal and the attention was focused on discovering the cause of the spikes.
About 4.5 hours later that same day once again our monitoring alerts were triggered and the problem presented itself in the same manner with overutilization of some server CPUs. This time the reboot process functioned as expected and all servers were successfully brought back online after a short time. Since we had not been able to discover any abnormal log patterns from the first incident we knew that there could be a systemic reason for these disruptions. This is when we discovered that the root cause was a database middleware we use to ironically increase performance and reliability and that had been in production successfully for over a year. Since this is a third party component we contacted the vendor immediately and opened a ticket for them to help us understand why it was causing overutilization.
Over the next two days we failed to completely fix the problem as the steps we were being given by the vendor, while appearing to help, continued to cause the same issues. When a third incident occurred it was at this point that the decision was made to completely remove the 3rd party software and reconfigure all of our servers to bypass it. Once that was done we began to see normal operating behavior from the platform and we felt that while we hadn't resolved the root cause, we had stabilized the platform enough to avoid future occurrences.
Unfortunately, this isn't where the story ends because there was another brief service disruption the following day (yesterday) exhibiting the very same symptoms as before only this time with the 3rd party component completely removed from the equation, or so we thought... After running through the same server reboot/recovery process we noticed the component was unexpectedly reinstalled! The reason why this happened is because in the time since the component was manually removed by the DevOps team, an automated build had been triggered as part of our continuous integration (CI) build process and it automatically reinstalled the software and reconfigured the servers to their old state. This was a total miss on our part and has highlighted to us some improvements we need to make in our CI process to account for abnormal environmental conditions.
Today we are certain the CPU overutilization that led to the service disruptions were caused by the database middleware as evidenced by the sequence of events that occurred. Even though this repeatedly affected only some of our servers we have removed the component from all servers until such a time as we can understand what happened using help from the software vendor. We do not anticipate the temporary removal of this component to cause any issues as it not a critical piece of our platform operation, but there to enhance it only.
During this process we saw clearly how we can improve our service and equally important, how we can improve our communications if incidents like this ever occur. We are grateful that some of our customers have given us feedback that they appreciated our efforts to resolve the problem, however, it was clear that we should've communicated much better while it was occurring. To this end, our executive team is immediately working cross-departmentally to improve our status reporting with more timely and meaningful updates. The DevOps team is also working on an immediate plan to change some of our automated processes and notifications to better equip us to handle these types of incidents with confidence and expedience.