Query Plan Changed Overnight with No Code Change

Tuesday morning standups had barely started when the first message landed in the database channel: the customer dashboard was loading in 18 seconds. Monday it had loaded in 400ms. No deployment overnight. No schema changes. The query was identical. The plan was not.

The Alert

Application performance monitoring: api/dashboard/customer p99 latency 18,400ms. Threshold: 2,000ms. The endpoint ran one primary query — a JOIN across three tables returning customer order summaries. That query had been stable at 380ms for four months. At some point between 23:00 and 08:00 it had switched plans.

First Hypothesis: Data Volume Changed

A large batch import or data load overnight could have shifted table statistics enough to change the planner's decision. We checked pg_stat_user_tables for recent changes.

Check for recent large inserts or updates

PostgreSQL

SELECT
  relname,
  n_live_tup,
  n_dead_tup,
  last_analyze,
  last_autoanalyze,
  last_autovacuum,
  n_ins_since_vacuum
FROM pg_stat_user_tables
WHERE relname IN ('orders', 'customers', 'order_items');

autoanalyze had run on order_items at 02:14. 847,000 rows had been inserted by the overnight batch — enough to trigger analyze. The statistics had updated. But the question was whether the new statistics justified the new plan or had led the planner somewhere wrong.

The Discovery

Compare estimated vs actual rows using EXPLAIN ANALYZE

PostgreSQL

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT c.customer_id, c.name, COUNT(o.order_id), SUM(oi.total)
FROM   customers   c
JOIN   orders      o  ON o.customer_id = c.customer_id
JOIN   order_items oi ON oi.order_id = o.order_id
WHERE  c.account_status = 'active'
GROUP BY 1, 2;

The plan now used a Hash Join instead of a Nested Loop. The Hash Join was appropriate for large datasets but catastrophic for this query. Estimated rows from customers: 124,000. Actual: 12,400. The planner was estimating 10x more active customers than existed because the statistics on the account_status column had a histogram with only 10 buckets — not enough resolution to accurately estimate a low-selectivity filter on a skewed column.

Incident Timeline

Time	Event
02:14	Autoanalyze runs on order_items (847K new rows from batch)
02:14	Statistics updated — histogram rebuild with default 300 buckets
02:15	Plan cache for the dashboard query invalidated
~02:15	First execution with new plan — Hash Join chosen. 18s execution.
08:17	Alert fires. Dashboard p99 at 18,400ms.
08:45	EXPLAIN ANALYZE run. Bad cardinality estimate identified.
09:02	ALTER TABLE ... ALTER COLUMN ... SET STATISTICS 500 applied to account_status
09:03	ANALYZE customers run manually
09:04	Plan reverts to Nested Loop. Dashboard: 390ms.

Root Cause

The default statistics target (100 histogram buckets) was insufficient to accurately represent the distribution of account_status. After the batch insert changed the ratio of active to inactive customers, the low-resolution histogram produced a 10x overestimate of active customer count. The planner chose Hash Join — correct for 124,000 rows, catastrophic for 12,400.

The Fix

Increase statistics target on selective columns

PostgreSQL

-- Increase histogram resolution for this column
ALTER TABLE customers
ALTER COLUMN account_status SET STATISTICS 500;

-- Rebuild statistics immediately
ANALYZE customers;

-- Pin the plan if regressions keep occurring:
-- Use pg_hint_plan extension or SET enable_hashjoin = off for this session

Prevention

Any column used in a WHERE clause on a table over 1M rows now gets SET STATISTICS 500 as part of the schema migration checklist. A weekly query compares estimated vs actual row counts in pg_stat_statements for the top 50 queries by total time — any query with a 5x+ estimation error triggers a statistics audit.

Marcus Reid

Staff DBA · PostgreSQL Contributor · QueryTuning

Marcus has been managing large PostgreSQL deployments for 14 years — most recently as Staff DBA at a payments company running 40,000 transactions per second. He writes about query internals, autovacuum, and the incidents that actually teach you something.

Why Your PostgreSQL Query Plan Changed Overnight — and How to Pin It Back

The Alert

First Hypothesis: Data Volume Changed

The Discovery

Incident Timeline

Root Cause

The Fix

Prevention

The Same Problem on Other Engines