Autovacuum Ate Our Entire IOPS Budget for Six Weeks

Six weeks of phantom slowdowns. Every night between 02:00 and 06:00, query latency across all tables would climb 3-5x. CPU was flat. Connections were normal. Disk I/O was pegged at 100% — but no query in pg_stat_activity was doing anything obviously expensive. The cause was autovacuum, running with no I/O budget, consuming every disk operation the server had.

The Alert

Overnight monitoring showed disk read/write latency spiking every night in the same window. The pattern was too regular to be random queries. Something scheduled was running. The autovacuum process does not appear in pg_stat_activity with a query — it shows as a maintenance process — and for six weeks it had not been on the investigation list.

First Hypothesis: Overnight Batch Jobs

Cron jobs, maintenance scripts, or analytics queries running overnight were the first assumption. We pulled pg_stat_statements for the high-I/O window.

Find high I/O queries from pg_stat_statements

PostgreSQL

SELECT
  substring(query, 1, 80) AS query_preview,
  calls,
  round(total_exec_time::numeric, 2) AS total_ms,
  shared_blks_read,
  shared_blks_written,
  local_blks_read + temp_blks_read AS temp_reads
FROM   pg_stat_statements
ORDER BY shared_blks_read + shared_blks_written DESC
LIMIT 20;

No smoking gun. The I/O was not from user queries. It had to be a background process. We looked at pg_stat_progress_vacuum.

The Discovery

Monitor active autovacuum workers

PostgreSQL

SELECT
  schemaname,
  relname,
  phase,
  heap_blks_total,
  heap_blks_scanned,
  round(100.0 * heap_blks_scanned / nullif(heap_blks_total, 0), 2) AS pct_done,
  index_vacuum_count,
  num_dead_tuples
FROM   pg_stat_progress_vacuum;

-- Also check the cost settings:
SHOW autovacuum_vacuum_cost_delay;   -- default: 2ms
SHOW autovacuum_vacuum_cost_limit;   -- default: 200

autovacuum_vacuum_cost_delay was 2ms. autovacuum_vacuum_cost_limit was 200. These are the default settings. They allow autovacuum to consume approximately 100MB/second of I/O with minimal throttling. The events table — 800GB, 120 million rows, with 30% dead tuple ratio — was being vacuumed every night for 4 hours with no I/O cap, saturating the disk while every other query waited for I/O.

Incident Timeline

Time	Event
6 weeks ago	Events table grows past autovacuum trigger threshold. Nightly vacuums begin.
Every night 02:00	Autovacuum starts on events table. I/O saturates immediately.
Every night 02:00-06:00	All queries competing for disk I/O. 3-5x latency increase.
Week 6	Investigation begins. pg_stat_progress_vacuum checked for first time.
Investigation day	autovacuum_vacuum_cost_delay=2ms identified as the unrestricted setting
+30 minutes	Per-table cost delay set to 20ms on events table
Next night	Vacuum runs 10x slower. I/O remains within budget. No latency spike.

Root Cause

Default autovacuum cost settings allow vacuum to run with minimal I/O throttling. On small tables with low dead tuple rates this is fine. On an 800GB table with 30% dead tuples, a 4-hour vacuum at 100MB/second saturates any shared disk subsystem. The default settings are correct for tables up to a few GB. They require tuning on large tables.

The Fix

Set per-table autovacuum cost limits for large tables

PostgreSQL

-- Apply a strict I/O budget to the large table
-- cost_delay=20ms + cost_limit=200 = roughly 10MB/s max I/O from vacuum
ALTER TABLE events SET (
  autovacuum_vacuum_cost_delay = 20,
  autovacuum_vacuum_cost_limit = 200
);

-- For the global setting, a more conservative default:
-- autovacuum_vacuum_cost_delay = 10ms   # in postgresql.conf
-- autovacuum_vacuum_cost_limit = 200

-- Monitor vacuum progress and I/O impact:
SELECT relname, phase, heap_blks_scanned, heap_blks_total,
       round(100.0*heap_blks_scanned/nullif(heap_blks_total,0),1) AS pct
FROM   pg_stat_progress_vacuum;

Prevention

Any table over 50GB now has explicit autovacuum cost settings as part of its DDL definition. A weekly report checks all tables over 10GB that are using default cost settings and flags them for review.

Sanjay Rao

Staff DBA · PostgreSQL · 11 years production · QueryTuning

Written by a working DBA from direct production experience. All queries in this article have been run on live database instances.

The Autovacuum That Quietly Ate Our Entire IOPS Budget for Six Weeks

The Alert

First Hypothesis: Overnight Batch Jobs

The Discovery

Incident Timeline

Root Cause

The Fix

Prevention

The Same Problem on Other Engines