Production primary filled its data volume at 03:47 on a Thursday. PostgreSQL stopped accepting writes. The replication dashboard showed no lag. The connection count was normal. CPU was at 12%. Every metric that was monitored showed healthy. The one metric that was not monitored had been accumulating WAL for six days.
The Alert
Alert: /var/lib/postgresql/data at 98% capacity. Disk fills on a PostgreSQL primary are catastrophic — the database will stop writing. The data directory had 2TB. The database itself used 1.4TB. The remaining 600GB had disappeared over six days without any monitored metric changing.
First Hypothesis: Table Bloat
Dead tuples accumulating after a high-delete workload can grow a table's physical size significantly. We checked for bloated tables.
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size, n_dead_tup, n_live_tup, ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct, last_autovacuum FROM pg_stat_user_tables ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 20;
Tables looked normal. No significant bloat. Autovacuum was current. The disk space was not in the tables.
The Discovery
-- Check WAL directory size SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) AS total_wal; -- The key query: check all replication slots SELECT slot_name, slot_type, active, active_pid, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots;
Result: two active replication slots, both with minimal lag. One inactive slot: analytics_replica_slot. Created six days ago. Active: false. Active PID: null. Retained WAL: 547GB.
The analytics team had created a logical replication slot for a new reporting replica, connected once to test it, and then the replica had been taken offline for reconfiguration. The slot had sat inactive for six days, retaining every WAL segment generated since its creation. PostgreSQL will never remove WAL that a slot has not consumed.
Incident Timeline
| Time | Event |
|---|---|
| Day -6, 14:22 | analytics_replica_slot created. Replica connected briefly, then offline. |
| Day -6 to Day 0 | 547GB of WAL retained. No alert. No monitoring. |
| Day 0, 03:47 | Disk at 98%. Alert fires. |
| 03:52 | On-call engineer connects. Tables look healthy. WAL directory: 547GB. |
| 03:55 | pg_replication_slots queried. Inactive slot with 547GB retained WAL found. |
| 03:56 | Slot confirmed as abandoned (analytics team) |
| 03:57 | SELECT pg_drop_replication_slot('analytics_replica_slot') |
| 03:58 | 547GB of WAL removed. Disk at 23%. |
| 03:58 | PostgreSQL writes resume. Incident resolved. |
Root Cause
PostgreSQL guarantees that a replication slot consumer will never miss WAL segments — even if the consumer is offline for days or weeks. This is by design and is the correct behaviour. The gap was monitoring: lag in bytes and seconds measures active slots. An inactive slot with no consumer appears with zero lag because it has no active sender. pg_replication_slots is the only place where inactive slot retention is visible.
The Fix
-- Drop the abandoned slot immediately SELECT pg_drop_replication_slot('analytics_replica_slot'); -- Monitor all slots — active and inactive — every 5 minutes: SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) / 1024/1024/1024 AS retained_gb FROM pg_replication_slots WHERE pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 1073741824; -- > 1GB
Prevention
Any slot inactive for more than 1 hour with more than 5GB retained WAL now fires an alert. Any slot created must have an owner — a team name in a comment — and an expected maximum offline duration. Slots exceeding their offline window are dropped automatically after an alert and a 30-minute grace period.