QueryTuning.org
Database Query Performance Reference
PostgreSQLReplicationWAL

The Replication Slot That Filled the Disk While We Watched the Wrong Dashboards

MR
Marcus Reid
Staff DBA · PostgreSQL · 14 years
March 28, 2026
11 min read
PostgreSQL 10+

Production primary filled its data volume at 03:47 on a Thursday. PostgreSQL stopped accepting writes. The replication dashboard showed no lag. The connection count was normal. CPU was at 12%. Every metric that was monitored showed healthy. The one metric that was not monitored had been accumulating WAL for six days.

The Alert

Alert: /var/lib/postgresql/data at 98% capacity. Disk fills on a PostgreSQL primary are catastrophic — the database will stop writing. The data directory had 2TB. The database itself used 1.4TB. The remaining 600GB had disappeared over six days without any monitored metric changing.

First Hypothesis: Table Bloat

Dead tuples accumulating after a high-delete workload can grow a table's physical size significantly. We checked for bloated tables.

Find the largest tables and their bloat estimates
PostgreSQL
SELECT
  schemaname, tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
  n_dead_tup,
  n_live_tup,
  ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct,
  last_autovacuum
FROM   pg_stat_user_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;

Tables looked normal. No significant bloat. Autovacuum was current. The disk space was not in the tables.

The Discovery

Check where the disk space actually went
PostgreSQL
-- Check WAL directory size
SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) AS total_wal;

-- The key query: check all replication slots
SELECT
  slot_name,
  slot_type,
  active,
  active_pid,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal,
  restart_lsn,
  confirmed_flush_lsn
FROM pg_replication_slots;

Result: two active replication slots, both with minimal lag. One inactive slot: analytics_replica_slot. Created six days ago. Active: false. Active PID: null. Retained WAL: 547GB.

The analytics team had created a logical replication slot for a new reporting replica, connected once to test it, and then the replica had been taken offline for reconfiguration. The slot had sat inactive for six days, retaining every WAL segment generated since its creation. PostgreSQL will never remove WAL that a slot has not consumed.

Incident Timeline

TimeEvent
Day -6, 14:22analytics_replica_slot created. Replica connected briefly, then offline.
Day -6 to Day 0547GB of WAL retained. No alert. No monitoring.
Day 0, 03:47Disk at 98%. Alert fires.
03:52On-call engineer connects. Tables look healthy. WAL directory: 547GB.
03:55pg_replication_slots queried. Inactive slot with 547GB retained WAL found.
03:56Slot confirmed as abandoned (analytics team)
03:57SELECT pg_drop_replication_slot('analytics_replica_slot')
03:58547GB of WAL removed. Disk at 23%.
03:58PostgreSQL writes resume. Incident resolved.

Root Cause

PostgreSQL guarantees that a replication slot consumer will never miss WAL segments — even if the consumer is offline for days or weeks. This is by design and is the correct behaviour. The gap was monitoring: lag in bytes and seconds measures active slots. An inactive slot with no consumer appears with zero lag because it has no active sender. pg_replication_slots is the only place where inactive slot retention is visible.

The Fix

Drop the abandoned slot and add monitoring
PostgreSQL
-- Drop the abandoned slot immediately
SELECT pg_drop_replication_slot('analytics_replica_slot');

-- Monitor all slots — active and inactive — every 5 minutes:
SELECT
  slot_name,
  active,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal,
  pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) / 1024/1024/1024 AS retained_gb
FROM pg_replication_slots
WHERE pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 1073741824; -- > 1GB

Prevention

Any slot inactive for more than 1 hour with more than 5GB retained WAL now fires an alert. Any slot created must have an owner — a team name in a comment — and an expected maximum offline duration. Slots exceeding their offline window are dropped automatically after an alert and a 30-minute grace period.

MR
Marcus Reid
Staff DBA · PostgreSQL · 14 years · QueryTuning
Marcus has been managing large PostgreSQL deployments for 14 years. He writes about WAL management, replication, and the monitoring blind spots that fill disks at 3 AM.