QueryTuning.org
Database Query Performance Reference
PostgreSQLWAL ArchivingDisk

Our archive_command Was Silently Returning 0 While Failing — WAL Filled the Disk in Six Days

MR
Marcus Reid
Staff DBA · PostgreSQL · 14 years
March 7, 2026
9 min read
PostgreSQL 9.4+

The archive_command had been returning exit code 0 for eleven days. Exit code 0 means success. PostgreSQL had been marking every WAL segment as successfully archived. But the archive destination was not receiving them. Eleven days of WAL had accumulated in pg_wal/ with no copy anywhere else. A single disk failure would have meant complete data loss for eleven days.

The Alert

Not an automated alert — a manual check during a routine DR test. The archive destination had 11 days of missing WAL segments. Attempting to restore from backup + WAL failed at the day-11 boundary. The archive_command in postgresql.conf was a shell script that had been silently swallowing errors since a certificate renewal had broken the remote archive endpoint.

The Discovery

Check archive status — the most important monitoring query for PITR setups
PostgreSQL
SELECT
  archived_count,
  last_archived_wal,
  last_archived_time,
  failed_count,
  last_failed_wal,
  last_failed_time,
  now() - last_archived_time AS time_since_last_archive
FROM   pg_stat_archiver;

-- Alert if last archive was more than 5 minutes ago during active hours
SELECT
  last_archived_wal,
  EXTRACT(EPOCH FROM (now() - last_archived_time)) / 60 AS minutes_since_archive,
  failed_count
FROM pg_stat_archiver
WHERE EXTRACT(EPOCH FROM (now() - last_archived_time)) > 300;

Incident Timeline

TimeEvent
Day -11TLS certificate renewed on archive server. rsync begins failing with exit 23.
Day -11archive_command script returns exit 0 regardless. PostgreSQL sees success.
Day -11 to 011 days of WAL accumulates in pg_wal/. No alert fires.
Day 0DR test run. Restore fails. Archive gap discovered.
+2 hoursCertificate issue identified and fixed. Backfill of 11 days of WAL initiated.

Root Cause

The archive_command script explicitly returned 0 on all errors to prevent log flooding. PostgreSQL trusts the archive_command return code completely. A certificate error caused rsync to return exit 23, which the script swallowed and converted to 0, telling PostgreSQL every segment had archived successfully.

The Fix

A correct archive_command that never swallows errors
PostgreSQL
#!/bin/bash
# /usr/local/bin/wal_archive.sh %p %f
WAL_FILE="$1"
WAL_NAME="$2"
DEST="archive-server:/wal-archive/"

rsync -az --checksum "$WAL_FILE" "${DEST}${WAL_NAME}"
EXIT_CODE=$?

if [ $EXIT_CODE -ne 0 ]; then
  logger -t postgresql "WAL archive FAILED: $WAL_NAME exit=$EXIT_CODE"
  exit $EXIT_CODE  # Never swallow — PostgreSQL must know
fi
exit 0

Prevention

pg_stat_archiver is now queried every 5 minutes. Any gap of more than 10 minutes fires a critical alert. The DR restore test runs monthly. archive_command scripts are reviewed for exit code handling before any infrastructure change.

Confirming the Archive Gap and Measuring It

Before fixing the archive, you need to know exactly how many WAL segments are missing and whether any are still in pg_wal/ — on disk — so they can be copied manually to the archive destination. Once WAL segments are recycled by PostgreSQL, they are permanently lost.

Measure the gap — how many segments, how much data
PostgreSQL
-- Count unarchived WAL segments currently in pg_wal/:
-- Run this from the OS (as postgres user):
-- ls $PGDATA/pg_wal/ | wc -l
-- Each WAL segment is 16MB by default (wal_segment_size).
-- 100 segments = 1.6GB of data not yet archived.

-- From SQL — check how far behind the archive is:
SELECT
  last_archived_wal,
  last_archived_time,
  now() - last_archived_time                         AS archive_lag,
  failed_count,
  last_failed_wal,
  last_failed_time
FROM   pg_stat_archiver;

-- The LSN gap between last archived and current:
SELECT
  pg_walfile_name(pg_current_wal_lsn())    AS current_wal,
  last_archived_wal,
  pg_size_pretty(
    pg_wal_lsn_diff(
      pg_current_wal_lsn(),
      ('x' || substring(last_archived_wal, 9))::bit(32)::bigint::text || '/'
        || ('x' || substring(last_archived_wal, 17))::bit(32)::bigint * 16777216
    )
  ) AS unarchived_wal_estimate
FROM   pg_stat_archiver;
Test the archive command manually before relying on PostgreSQL to retry
PostgreSQL
## From the OS, as the postgres user, test the exact command:
## Replace %p with the full path to a WAL file and %f with its filename

# Find a current WAL file to test with:
# ls $PGDATA/pg_wal/000000010000* | head -1

# Run the archive command manually:
# /usr/local/bin/wal_archive.sh /var/lib/postgresql/14/main/pg_wal/000000010000000100000001 000000010000000100000001
# echo $?   ← must be 0 for archive to be working
MR
Marcus Reid
Staff DBA · PostgreSQL · 14 years · QueryTuning
Marcus has been managing large PostgreSQL deployments for 14 years. He writes about WAL management, archiving strategy, and the operational blind spots that fill disks at 3 AM.