Websites are unavailable

Incident Report for AdGuard

Postmortem

The incident started at 23:30 UTC when one of the hardware servers we host our infrastructure on has died. Generally, we're ready for this kind of disaster because we have hot standby nodes for all our essential services, and the switch is made automatically with no need to involve anyone. Unfortunately, in this case, it was a broken non-essential service that caused a chain reaction.

That hardware server hosted the master node of our PostgreSQL database, the authentication service, and a bunch of non-essential internal services. At first, everything went as expected. It took about 30 seconds to switch the database master to the standby replica node automatically. The auth service has also switched to standby node successfully.

Regarding non-essential services, one of them was our instance of Sentry (error reporting). Losing Sentry is bad, but it is not the end of the world: it is not necessary for everything else to work, and we could always recover it from the backup.

The problem was that the hardware node didn't die entirely, and Sentry's VM was in a "half-dead" state. All network connections to Sentry were now timing out. This behavior exposed a bug in our code that caused the breakage of the authentication service.

Here's what happened:

While the database was switching to the replica node, a lot of errors occurred in the auth service.
The auth service was desperately trying to report those errors to Sentry.
Every request to Sentry was now timing out, and at some point, the auth service was overwhelmed with the number of error reports waiting to be sent.
This led to the auth service completely stopping to work and breaking all the services that depend on authentication (websites and the license server).
This was the standby auth service (the master was lost on the faulty server), so it could not switch to another standby node.

It took us some time to figure that out and implement the changes necessary to resolve that issue.

Here’s what we're going to do to make sure that this won't happen again:

Check that all essential services can live through this kind of outage. Whatever happens to Sentry, it must not bring down anything with it.
Increase the level of redundancy. If we had a standby node for Sentry, the issue wouldn't have happened in the first place.

Posted Jun 13, 2021 - 12:31 UTC

Resolved

This incident has been resolved.

Posted Jun 13, 2021 - 04:43 UTC

Monitoring

All services are operational, we're monitoring results now

Posted Jun 13, 2021 - 03:58 UTC

Update

Most of the web services are back operational now. The website may be slower than usual since we're in the process of migrating some of the internal services to different nodes.

Posted Jun 13, 2021 - 03:42 UTC

Update

The websites are back, but we're yet to restore everything.

The root cause of the issue is a faulty hardware server node that hosted a number of internal services. We're yet to write an official post mortem, but from what we see now, it seems that once that server node died and became unavailable, authentication and the website also went down and haven't switched to the backup nodes automatically.

Posted Jun 13, 2021 - 02:46 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Jun 13, 2021 - 02:24 UTC

Update

We are continuing to investigate this issue.

Posted Jun 12, 2021 - 23:30 UTC

Investigating

We are currently investigating this issue.

Posted Jun 12, 2021 - 23:30 UTC

This incident affected: Website & services.