Six weeks of phantom slowdowns. Every night between 02:00 and 06:00, query latency across all tables would climb 3-5x. CPU was flat. Connections were normal. Disk I/O was pegged at 100% — but no query in pg_stat_activity was doing anything obviously expensive. The cause was autovacuum, running with no I/O budget, consuming every disk operation the server had.
The Alert
Overnight monitoring showed disk read/write latency spiking every night in the same window. The pattern was too regular to be random queries. Something scheduled was running. The autovacuum process does not appear in pg_stat_activity with a query — it shows as a maintenance process — and for six weeks it had not been on the investigation list.
First Hypothesis: Overnight Batch Jobs
Cron jobs, maintenance scripts, or analytics queries running overnight were the first assumption. We pulled pg_stat_statements for the high-I/O window.
SELECT substring(query, 1, 80) AS query_preview, calls, round(total_exec_time::numeric, 2) AS total_ms, shared_blks_read, shared_blks_written, local_blks_read + temp_blks_read AS temp_reads FROM pg_stat_statements ORDER BY shared_blks_read + shared_blks_written DESC LIMIT 20;
No smoking gun. The I/O was not from user queries. It had to be a background process. We looked at pg_stat_progress_vacuum.
The Discovery
SELECT schemaname, relname, phase, heap_blks_total, heap_blks_scanned, round(100.0 * heap_blks_scanned / nullif(heap_blks_total, 0), 2) AS pct_done, index_vacuum_count, num_dead_tuples FROM pg_stat_progress_vacuum; -- Also check the cost settings: SHOW autovacuum_vacuum_cost_delay; -- default: 2ms SHOW autovacuum_vacuum_cost_limit; -- default: 200
autovacuum_vacuum_cost_delay was 2ms. autovacuum_vacuum_cost_limit was 200. These are the default settings. They allow autovacuum to consume approximately 100MB/second of I/O with minimal throttling. The events table — 800GB, 120 million rows, with 30% dead tuple ratio — was being vacuumed every night for 4 hours with no I/O cap, saturating the disk while every other query waited for I/O.
Incident Timeline
| Time | Event |
|---|---|
| 6 weeks ago | Events table grows past autovacuum trigger threshold. Nightly vacuums begin. |
| Every night 02:00 | Autovacuum starts on events table. I/O saturates immediately. |
| Every night 02:00-06:00 | All queries competing for disk I/O. 3-5x latency increase. |
| Week 6 | Investigation begins. pg_stat_progress_vacuum checked for first time. |
| Investigation day | autovacuum_vacuum_cost_delay=2ms identified as the unrestricted setting |
| +30 minutes | Per-table cost delay set to 20ms on events table |
| Next night | Vacuum runs 10x slower. I/O remains within budget. No latency spike. |
Root Cause
Default autovacuum cost settings allow vacuum to run with minimal I/O throttling. On small tables with low dead tuple rates this is fine. On an 800GB table with 30% dead tuples, a 4-hour vacuum at 100MB/second saturates any shared disk subsystem. The default settings are correct for tables up to a few GB. They require tuning on large tables.
The Fix
-- Apply a strict I/O budget to the large table -- cost_delay=20ms + cost_limit=200 = roughly 10MB/s max I/O from vacuum ALTER TABLE events SET ( autovacuum_vacuum_cost_delay = 20, autovacuum_vacuum_cost_limit = 200 ); -- For the global setting, a more conservative default: -- autovacuum_vacuum_cost_delay = 10ms # in postgresql.conf -- autovacuum_vacuum_cost_limit = 200 -- Monitor vacuum progress and I/O impact: SELECT relname, phase, heap_blks_scanned, heap_blks_total, round(100.0*heap_blks_scanned/nullif(heap_blks_total,0),1) AS pct FROM pg_stat_progress_vacuum;
Prevention
Any table over 50GB now has explicit autovacuum cost settings as part of its DDL definition. A weekly report checks all tables over 10GB that are using default cost settings and flags them for review.