Tuesday morning standups had barely started when the first message landed in the database channel: the customer dashboard was loading in 18 seconds. Monday it had loaded in 400ms. No deployment overnight. No schema changes. The query was identical. The plan was not.
The Alert
Application performance monitoring: api/dashboard/customer p99 latency 18,400ms. Threshold: 2,000ms. The endpoint ran one primary query — a JOIN across three tables returning customer order summaries. That query had been stable at 380ms for four months. At some point between 23:00 and 08:00 it had switched plans.
First Hypothesis: Data Volume Changed
A large batch import or data load overnight could have shifted table statistics enough to change the planner's decision. We checked pg_stat_user_tables for recent changes.
SELECT relname, n_live_tup, n_dead_tup, last_analyze, last_autoanalyze, last_autovacuum, n_ins_since_vacuum FROM pg_stat_user_tables WHERE relname IN ('orders', 'customers', 'order_items');
autoanalyze had run on order_items at 02:14. 847,000 rows had been inserted by the overnight batch — enough to trigger analyze. The statistics had updated. But the question was whether the new statistics justified the new plan or had led the planner somewhere wrong.
The Discovery
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT c.customer_id, c.name, COUNT(o.order_id), SUM(oi.total) FROM customers c JOIN orders o ON o.customer_id = c.customer_id JOIN order_items oi ON oi.order_id = o.order_id WHERE c.account_status = 'active' GROUP BY 1, 2;
The plan now used a Hash Join instead of a Nested Loop. The Hash Join was appropriate for large datasets but catastrophic for this query. Estimated rows from customers: 124,000. Actual: 12,400. The planner was estimating 10x more active customers than existed because the statistics on the account_status column had a histogram with only 10 buckets — not enough resolution to accurately estimate a low-selectivity filter on a skewed column.
Incident Timeline
| Time | Event |
|---|---|
| 02:14 | Autoanalyze runs on order_items (847K new rows from batch) |
| 02:14 | Statistics updated — histogram rebuild with default 300 buckets |
| 02:15 | Plan cache for the dashboard query invalidated |
| ~02:15 | First execution with new plan — Hash Join chosen. 18s execution. |
| 08:17 | Alert fires. Dashboard p99 at 18,400ms. |
| 08:45 | EXPLAIN ANALYZE run. Bad cardinality estimate identified. |
| 09:02 | ALTER TABLE ... ALTER COLUMN ... SET STATISTICS 500 applied to account_status |
| 09:03 | ANALYZE customers run manually |
| 09:04 | Plan reverts to Nested Loop. Dashboard: 390ms. |
Root Cause
The default statistics target (100 histogram buckets) was insufficient to accurately represent the distribution of account_status. After the batch insert changed the ratio of active to inactive customers, the low-resolution histogram produced a 10x overestimate of active customer count. The planner chose Hash Join — correct for 124,000 rows, catastrophic for 12,400.
The Fix
-- Increase histogram resolution for this column ALTER TABLE customers ALTER COLUMN account_status SET STATISTICS 500; -- Rebuild statistics immediately ANALYZE customers; -- Pin the plan if regressions keep occurring: -- Use pg_hint_plan extension or SET enable_hashjoin = off for this session
Prevention
Any column used in a WHERE clause on a table over 1M rows now gets SET STATISTICS 500 as part of the schema migration checklist. A weekly query compares estimated vs actual row counts in pg_stat_statements for the top 50 queries by total time — any query with a 5x+ estimation error triggers a statistics audit.