Limited availability of our services

Incident Report for AdGuard

Postmortem

Summary
Our internal infrastructure experienced approximately one hour of degraded availability due to a failure in our network edge. One of our upstream provider’s routers in the datacenter became unreachable. As a result, one of our two edge routers lost upstream connectivity. While this is a relatively common failure scenario — and one we are explicitly architected to tolerate — our redundancy mechanism did not operate as expected.

Architecture Overview
Our edge routing stack is designed for high availability. It consists of two physical routers configured to act as a single logical gateway using a shared IP address. From the perspective of connected systems, this setup (often referred to as MLAG-style L3 redundancy or a "virtual router") appears as a single device.
Inbound traffic is typically distributed across both routers based on hashing (ECMP or per-flow load balancing), and under normal circumstances, either router can forward traffic upstream.

What Went Wrong
When one of the upstream links failed:The affected router remained active in the logical group and continued accepting traffic.
The hash-based forwarding mechanism continued to assign flows to both routers, including the one that had no upstream connectivity.
As a result, approximately half of the traffic was routed to a black hole — silently dropped by the router with no upstream.
This manifested externally as intermittent availability — services appeared "flaky" or unreachable in ~50% of cases depending on which path the packet was hashed to.

Mitigation
The immediate resolution involved manually removing the non-functional router from the logical group. This is a non-trivial operation, as simply powering down the router can have unintended side effects. The process took longer than expected due to its operational complexity.
Once the faulty node was excluded, all traffic was successfully routed through the healthy router, and services stabilized.

Next Steps
We are currently investigating why automatic failover did not trigger as expected. Our routers are designed to detect upstream failures and withdraw from the logical group accordingly, but that mechanism failed silently.
As a follow-up, we will:

  • Reproduce the failure scenario in a controlled environment
  • Validate and adjust failover and tracking logic
  • Improve observability for edge failover behavior
  • Develop faster manual intervention playbooks
Posted Jul 11, 2025 - 13:46 UTC

Resolved

This incident has been resolved.
Posted Jul 11, 2025 - 13:43 UTC

Update

We are continuing to monitor for any further issues.
Posted Jul 11, 2025 - 13:09 UTC

Monitoring

The fix has been applied successfully. We've currently monitoring our services for any lingering impact.
Posted Jul 11, 2025 - 13:09 UTC

Identified

We've identified the issue, and currently are implementing the fix.
Posted Jul 11, 2025 - 12:11 UTC

Investigating

We are currently investigating a network issue that is causing degraded availability of our websites, AdGuard VPN, and AdGuard DNS.
Posted Jul 11, 2025 - 11:52 UTC
This incident affected: Website & services, AdGuard DNS, and AdGuard VPN.