The archive_command had been returning exit code 0 for eleven days. Exit code 0 means success. PostgreSQL had been marking every WAL segment as successfully archived. But the archive destination was not receiving them. Eleven days of WAL had accumulated in pg_wal/ with no copy anywhere else. A single disk failure would have meant complete data loss for eleven days.
The Alert
Not an automated alert — a manual check during a routine DR test. The archive destination had 11 days of missing WAL segments. Attempting to restore from backup + WAL failed at the day-11 boundary. The archive_command in postgresql.conf was a shell script that had been silently swallowing errors since a certificate renewal had broken the remote archive endpoint.
The Discovery
SELECT archived_count, last_archived_wal, last_archived_time, failed_count, last_failed_wal, last_failed_time, now() - last_archived_time AS time_since_last_archive FROM pg_stat_archiver; -- Alert if last archive was more than 5 minutes ago during active hours SELECT last_archived_wal, EXTRACT(EPOCH FROM (now() - last_archived_time)) / 60 AS minutes_since_archive, failed_count FROM pg_stat_archiver WHERE EXTRACT(EPOCH FROM (now() - last_archived_time)) > 300;
Incident Timeline
| Time | Event |
|---|---|
| Day -11 | TLS certificate renewed on archive server. rsync begins failing with exit 23. |
| Day -11 | archive_command script returns exit 0 regardless. PostgreSQL sees success. |
| Day -11 to 0 | 11 days of WAL accumulates in pg_wal/. No alert fires. |
| Day 0 | DR test run. Restore fails. Archive gap discovered. |
| +2 hours | Certificate issue identified and fixed. Backfill of 11 days of WAL initiated. |
Root Cause
The archive_command script explicitly returned 0 on all errors to prevent log flooding. PostgreSQL trusts the archive_command return code completely. A certificate error caused rsync to return exit 23, which the script swallowed and converted to 0, telling PostgreSQL every segment had archived successfully.
The Fix
#!/bin/bash # /usr/local/bin/wal_archive.sh %p %f WAL_FILE="$1" WAL_NAME="$2" DEST="archive-server:/wal-archive/" rsync -az --checksum "$WAL_FILE" "${DEST}${WAL_NAME}" EXIT_CODE=$? if [ $EXIT_CODE -ne 0 ]; then logger -t postgresql "WAL archive FAILED: $WAL_NAME exit=$EXIT_CODE" exit $EXIT_CODE # Never swallow — PostgreSQL must know fi exit 0
Prevention
pg_stat_archiver is now queried every 5 minutes. Any gap of more than 10 minutes fires a critical alert. The DR restore test runs monthly. archive_command scripts are reviewed for exit code handling before any infrastructure change.
Confirming the Archive Gap and Measuring It
Before fixing the archive, you need to know exactly how many WAL segments are missing and whether any are still in pg_wal/ — on disk — so they can be copied manually to the archive destination. Once WAL segments are recycled by PostgreSQL, they are permanently lost.
-- Count unarchived WAL segments currently in pg_wal/: -- Run this from the OS (as postgres user): -- ls $PGDATA/pg_wal/ | wc -l -- Each WAL segment is 16MB by default (wal_segment_size). -- 100 segments = 1.6GB of data not yet archived. -- From SQL — check how far behind the archive is: SELECT last_archived_wal, last_archived_time, now() - last_archived_time AS archive_lag, failed_count, last_failed_wal, last_failed_time FROM pg_stat_archiver; -- The LSN gap between last archived and current: SELECT pg_walfile_name(pg_current_wal_lsn()) AS current_wal, last_archived_wal, pg_size_pretty( pg_wal_lsn_diff( pg_current_wal_lsn(), ('x' || substring(last_archived_wal, 9))::bit(32)::bigint::text || '/' || ('x' || substring(last_archived_wal, 17))::bit(32)::bigint * 16777216 ) ) AS unarchived_wal_estimate FROM pg_stat_archiver;
## From the OS, as the postgres user, test the exact command: ## Replace %p with the full path to a WAL file and %f with its filename # Find a current WAL file to test with: # ls $PGDATA/pg_wal/000000010000* | head -1 # Run the archive command manually: # /usr/local/bin/wal_archive.sh /var/lib/postgresql/14/main/pg_wal/000000010000000100000001 000000010000000100000001 # echo $? ← must be 0 for archive to be working