Summary
Our internal infrastructure experienced approximately one hour of degraded availability due to a failure in our network edge. One of our upstream provider’s routers in the datacenter became unreachable. As a result, one of our two edge routers lost upstream connectivity. While this is a relatively common failure scenario — and one we are explicitly architected to tolerate — our redundancy mechanism did not operate as expected.
Architecture Overview
Our edge routing stack is designed for high availability. It consists of two physical routers configured to act as a single logical gateway using a shared IP address. From the perspective of connected systems, this setup (often referred to as MLAG-style L3 redundancy or a "virtual router") appears as a single device.
Inbound traffic is typically distributed across both routers based on hashing (ECMP or per-flow load balancing), and under normal circumstances, either router can forward traffic upstream.
What Went Wrong
When one of the upstream links failed:The affected router remained active in the logical group and continued accepting traffic.
The hash-based forwarding mechanism continued to assign flows to both routers, including the one that had no upstream connectivity.
As a result, approximately half of the traffic was routed to a black hole — silently dropped by the router with no upstream.
This manifested externally as intermittent availability — services appeared "flaky" or unreachable in ~50% of cases depending on which path the packet was hashed to.
Mitigation
The immediate resolution involved manually removing the non-functional router from the logical group. This is a non-trivial operation, as simply powering down the router can have unintended side effects. The process took longer than expected due to its operational complexity.
Once the faulty node was excluded, all traffic was successfully routed through the healthy router, and services stabilized.
Next Steps
We are currently investigating why automatic failover did not trigger as expected. Our routers are designed to detect upstream failures and withdraw from the logical group accordingly, but that mechanism failed silently.
As a follow-up, we will: