The incident started at 23:30 UTC when one of the hardware servers we host our infrastructure on has died. Generally, we're ready for this kind of disaster because we have hot standby nodes for all our essential services, and the switch is made automatically with no need to involve anyone. Unfortunately, in this case, it was a broken non-essential service that caused a chain reaction.
That hardware server hosted the master node of our PostgreSQL database, the authentication service, and a bunch of non-essential internal services. At first, everything went as expected. It took about 30 seconds to switch the database master to the standby replica node automatically. The auth service has also switched to standby node successfully.
Regarding non-essential services, one of them was our instance of Sentry (error reporting). Losing Sentry is bad, but it is not the end of the world: it is not necessary for everything else to work, and we could always recover it from the backup.
The problem was that the hardware node didn't die entirely, and Sentry's VM was in a "half-dead" state. All network connections to Sentry were now timing out. This behavior exposed a bug in our code that caused the breakage of the authentication service.
Here's what happened:
It took us some time to figure that out and implement the changes necessary to resolve that issue.
Here’s what we're going to do to make sure that this won't happen again: