Logins Failing in North America Region
Incident Report for Aventri
Postmortem
Report Date: 2018/03/20
Issue Summary

On March 19, 2018 at 8:04 PM EST the etouches application became unavailable in North America to both front end and back end users for approximately 1 hour and 14 minutes. Back end users were able to see the login page, but unable to proceed beyond the login page. Registrants were redirected to an error page and were unable to begin registration.

Affected Customers

All customers in our North America region were impacted by this outage. All customers in APAC and EMEA regions were unaffected.

Timeline (all times Eastern Time)
  • March 20 @ 08:04 PM: Application became unavailable to all users
  • March 20 @ 08:05 PM: Our Ops team was notified of the outage by our automated monitoring system.
  • March 20 @ 09:18 PM: Ops team made the application available once again to all affected customers
Root Cause

The root cause of this outage was identified to be a catastrophic, unrecoverable failure of our North American application database cluster's storage resources at our hosting provider.  

Resolution and recovery

Once the issue was identified by the Ops team, a high priority ticket was opened with our hosting provider to provide assistance with resolving the database cluster storage failure. Since our hosting provider was unable to provide an ETA on the recovery of the storage system, our team made the decision to cut over to our failover database cluster and began the switchover process. The failover database cluster was activated but took longer than expected to come online. When investigated, it was identified that the cluster was in the middle of a scheduled backup job and would not allow it to be promoted to a primary cluster until complete. When this was identified, the backup job was terminated, the failover cluster became active, and it began to take requests from our application.

Once back online with the failover cluster, the Ops team continued to work with our hosting provider to repair and restore the primary database cluster. Once the team was confident that the primary database cluster was back to normal we scheduled a 30 minute maintenance window to switch back over to our normal infrastructure.

There was no data loss or exposure during the entire incident.

Corrective and Preventative Measures

Ultimately, our monitoring system, failover infrastructure, and Ops team performed as expected according to our disaster recovery procedure. While a positive outcome for a major outage event was ultimately achieved, we learned some valuable information that we believe we can use to make a failover switch quicker if ever needed in the future.

We sincerely apologize to all of our affected customers.

Posted over 1 year ago. Mar 20, 2018 - 22:40 EDT

Resolved
We have completed our scheduled maintenance and everything is back to normal
Posted over 1 year ago. Mar 20, 2018 - 04:09 EDT
Update
In order to minimize impact to some of our clients, we have scheduled the planned maintenance window to bring our master databases online for Mar 19, 2018 3am EST.

During this time, etouches may be unavailable in North America. We expect the maintenance to take less than 30 minutes.
Posted over 1 year ago. Mar 19, 2018 - 22:44 EDT
Update
We have been monitoring and seeing normal activity for the past 30 minutes. We will need to bring our master databases back online and will be scheduling a maintenance window currently tentatively set for 11:59pm EST. There could be up to 30 minutes of planned downtime during this maintenance window so we are coordinating with clients running mission-critical activities at that time. We will update this page if the maintenance window timing changes based on the above.
Posted over 1 year ago. Mar 19, 2018 - 21:51 EDT
Monitoring
Our failover instances are online and customers should be seeing normal performance now. We are continuing to monitor and of course bring our master databases online.
Posted over 1 year ago. Mar 19, 2018 - 21:20 EDT
Update
Our failover databases are unexpectedly slower to come online. We are working with our hosting provider to resolve as quickly as possible.
Posted over 1 year ago. Mar 19, 2018 - 21:15 EDT
Identified
This appears to be a hard down on our master database cluster so we are switching to a failover instance and expect to be online again in a few minutes.
Posted over 1 year ago. Mar 19, 2018 - 20:16 EDT
Investigating
We are currently investigating why our North American instance is not allowing new user logins.
Posted over 1 year ago. Mar 19, 2018 - 20:12 EDT
This incident affected: Web, Email, and Public API.