The Replication Slot That Filled the Disk While We Watc

Production primary filled its data volume at 03:47 on a Thursday. PostgreSQL stopped accepting writes. The replication dashboard showed no lag. The connection count was normal. CPU was at 12%. Every metric that was monitored showed healthy. The one metric that was not monitored had been accumulating WAL for six days.

The Alert

Alert: /var/lib/postgresql/data at 98% capacity. Disk fills on a PostgreSQL primary are catastrophic — the database will stop writing. The data directory had 2TB. The database itself used 1.4TB. The remaining 600GB had disappeared over six days without any monitored metric changing.

First Hypothesis: Table Bloat

Dead tuples accumulating after a high-delete workload can grow a table's physical size significantly. We checked for bloated tables.

Find the largest tables and their bloat estimates

PostgreSQL

SELECT
  schemaname, tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
  n_dead_tup,
  n_live_tup,
  ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct,
  last_autovacuum
FROM   pg_stat_user_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;

Tables looked normal. No significant bloat. Autovacuum was current. The disk space was not in the tables.

The Discovery

Check where the disk space actually went

PostgreSQL

-- Check WAL directory size
SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) AS total_wal;

-- The key query: check all replication slots
SELECT
  slot_name,
  slot_type,
  active,
  active_pid,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal,
  restart_lsn,
  confirmed_flush_lsn
FROM pg_replication_slots;

Result: two active replication slots, both with minimal lag. One inactive slot: analytics_replica_slot. Created six days ago. Active: false. Active PID: null. Retained WAL: 547GB.

The analytics team had created a logical replication slot for a new reporting replica, connected once to test it, and then the replica had been taken offline for reconfiguration. The slot had sat inactive for six days, retaining every WAL segment generated since its creation. PostgreSQL will never remove WAL that a slot has not consumed.

Incident Timeline

Time	Event
Day -6, 14:22	analytics_replica_slot created. Replica connected briefly, then offline.
Day -6 to Day 0	547GB of WAL retained. No alert. No monitoring.
Day 0, 03:47	Disk at 98%. Alert fires.
03:52	On-call engineer connects. Tables look healthy. WAL directory: 547GB.
03:55	pg_replication_slots queried. Inactive slot with 547GB retained WAL found.
03:56	Slot confirmed as abandoned (analytics team)
03:57	SELECT pg_drop_replication_slot('analytics_replica_slot')
03:58	547GB of WAL removed. Disk at 23%.
03:58	PostgreSQL writes resume. Incident resolved.

Root Cause

PostgreSQL guarantees that a replication slot consumer will never miss WAL segments — even if the consumer is offline for days or weeks. This is by design and is the correct behaviour. The gap was monitoring: lag in bytes and seconds measures active slots. An inactive slot with no consumer appears with zero lag because it has no active sender. pg_replication_slots is the only place where inactive slot retention is visible.

The Fix

Drop the abandoned slot and add monitoring

PostgreSQL

-- Drop the abandoned slot immediately
SELECT pg_drop_replication_slot('analytics_replica_slot');

-- Monitor all slots — active and inactive — every 5 minutes:
SELECT
  slot_name,
  active,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal,
  pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) / 1024/1024/1024 AS retained_gb
FROM pg_replication_slots
WHERE pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 1073741824; -- > 1GB

Prevention

Any slot inactive for more than 1 hour with more than 5GB retained WAL now fires an alert. Any slot created must have an owner — a team name in a comment — and an expected maximum offline duration. Slots exceeding their offline window are dropped automatically after an alert and a 30-minute grace period.

Marcus Reid

Staff DBA · PostgreSQL · 14 years · QueryTuning

Marcus has been managing large PostgreSQL deployments for 14 years. He writes about WAL management, replication, and the monitoring blind spots that fill disks at 3 AM.

The Replication Slot That Filled the Disk While We Watched the Wrong Dashboards

The Alert

First Hypothesis: Table Bloat

The Discovery

Incident Timeline

Root Cause

The Fix

Prevention

The Same Problem on Other Engines