We would like to extend my very sincere apologies for the disruption of service that our public API suffered for prolonged periods on 7/24 and 7/25. We realize this outage had a major impact on customers and vendors that rely on our public API for integrations and on-site activities and it was the longest one of its type that we’ve experienced in many years. The good news is that there was no impact to the application for business users or registrants and there was no data loss, corruption, or any other issue that could pose a severe impact to our customers’ businesses. Please find below a detailed analysis, remedy, and lessons learned from our technical team:
Issue Recognition and Analysis
At etouches we pride ourselves in delivering a high level of service availability and one of the processes we take to ensure we do not require scheduled downtime for the vast majority of our new releases is to prepare and execute some database scripts ahead of the release when we believe that they could take an extended amount of time to execute. On July 24th, 2017 at 4:08 PM EST our team executed one such database script that was meant to prepare the system for an upcoming enhancement to our API. This script was successfully tested in our Development, QA, and Stage environments otherwise we would not have executed in Production, however, what we failed to account for is varying usage patterns in our production environment vs the pre-production environments. These usage patterns caused the script to behave in an unexpected manner and led to the following behavior:
Many API endpoints began to fail, starting with authentication. Clients would receive an empty response instead of the expected response.
Because of the high volume of API requests we service, our error logs were flooded with messages and the third party tool we use to access these logs began to lag behind causing diagnosis to be difficult
Data replication to our read-only reporting database began to experience continually increasing replication lag
Once the root cause was identified, we first attempted to roll back just the data changes which failed at about 8:00 pm EST on July 24.
The first team assigned to the issue worked to resolve it and we believed they had resolved the problem at 11pm EST on July 24 by deploying a code Hotfix (patch). While authentication requests and other endpoints were being successfully executed, not all endpoints recovered.
Once the additional endpoints were identified, a new team was assigned the issue due to a shift change that occurred after the initial resolution above was believed to be in place. Information was transferred to this team and they began to troubleshoot the remaining endpoints.
On July 25, at approximately 11:00 am EST, the team believed the quickest way to resolve the remaining issues was to deploy the scheduled July 25th release and proceeded to do so with the help of our release management team.
After this deploy, we monitored the once again responsive error logs to ensure all issues were resolved, but continued to see several errors from the API.
At this point, the team then took the final action that resolved the issue by completely reverting the new API enhancement code (ET-24969) This was completed at and deployed at 2:00 pm EST on July 25th.
The reverted code from ET-24969 is currently back in QA while we determine the best way to test the upgrade process against Production API usage patterns.
Once determined, we will follow our normal process and deploy to Stage and back to Production.
We will be updating our customers should we determine that this upgrade is safest to do by taking the system offline for a scheduled maintenance window.
We are going to introduce a more formal process that includes gathering more data from our production logs to perform a more thorough evaluation of database changes and their impact when executed under production usage patterns
Once this procedure is created, we will be holding a series of training workshops for our development staff to ensure they are versed in this and all other critical aspects of the deployment process.
In the meantime, our Director of Development, VP of Technology, or Quick Response Team Development Manager will review all migration scripts prior to stage and production deployment.