On March 19, 2018 at 8:04 PM EST the etouches application became unavailable in North America to both front end and back end users for approximately 1 hour and 14 minutes. Back end users were able to see the login page, but unable to proceed beyond the login page. Registrants were redirected to an error page and were unable to begin registration.
All customers in our North America region were impacted by this outage. All customers in APAC and EMEA regions were unaffected.
The root cause of this outage was identified to be a catastrophic, unrecoverable failure of our North American application database cluster's storage resources at our hosting provider.
Once the issue was identified by the Ops team, a high priority ticket was opened with our hosting provider to provide assistance with resolving the database cluster storage failure. Since our hosting provider was unable to provide an ETA on the recovery of the storage system, our team made the decision to cut over to our failover database cluster and began the switchover process. The failover database cluster was activated but took longer than expected to come online. When investigated, it was identified that the cluster was in the middle of a scheduled backup job and would not allow it to be promoted to a primary cluster until complete. When this was identified, the backup job was terminated, the failover cluster became active, and it began to take requests from our application.
Once back online with the failover cluster, the Ops team continued to work with our hosting provider to repair and restore the primary database cluster. Once the team was confident that the primary database cluster was back to normal we scheduled a 30 minute maintenance window to switch back over to our normal infrastructure.
There was no data loss or exposure during the entire incident.
Ultimately, our monitoring system, failover infrastructure, and Ops team performed as expected according to our disaster recovery procedure. While a positive outcome for a major outage event was ultimately achieved, we learned some valuable information that we believe we can use to make a failover switch quicker if ever needed in the future.
We sincerely apologize to all of our affected customers.