Yesterday one of our standby database nodes stopped accepting PostgreSQL WAL logs which lead to the replication going out of sync. Here’s how it works: WAL logs from the main node are transferred to 2 standby nodes in order to maintain near-real-time replication and failover. When replication on one of standby nodes goes out of sync, WAL logs are no more deleted from the main node. Despite that we have monitoring in place, we thought that we had enough time to address the issue next day.
Unfortunately, this was a mistake. WAL logs filled all the free space on the main database node and by 08:00 UTC the problem became critical. We had to urgently shutdown the main node, restore the replication with standby nodes, and bring it back up. This lead to the overall downtime of ~30 minutes. In order to avoid this in the future, we increased the relevant alerts priority so that we react on similar troubles faster.