Tuesday, 09:15. CPU hit 100% and stayed there for 90 seconds. Application latency tripled. The batch ETL job that loads overnight transaction data was still running — but it had been running without issues for months. The CPU spike was not from the ETL itself.
The Alert
SELECT TOP 20 r.session_id, r.status, r.command, r.wait_type, r.cpu_time, r.total_elapsed_time / 1000 AS elapsed_sec, SUBSTRING(st.text, (r.statement_start_offset/2)+1, 200) AS stmt FROM sys.dm_exec_requests r CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) st WHERE r.session_id > 50 ORDER BY r.cpu_time DESC;
Dozens of sessions, all with command = SELECT, all showing wait_type = NULL (meaning they were actively burning CPU). Not one heavy query — many light queries all compiling simultaneously. We ran it three times in 10 seconds: the session list changed each time, but every snapshot showed 40+ sessions actively compiling. This was a recompilation storm.
First Hypothesis: Plan Cache Eviction
If the plan cache was under memory pressure, plans would be evicted and every query would need fresh compilation. We checked plan cache counters three consecutive times.
SELECT counter_name, cntr_value FROM sys.dm_os_performance_counters WHERE counter_name IN ( 'SQL Compilations/sec', 'SQL Re-Compilations/sec', 'Cache Object Counts', 'Cache Pages' ) AND instance_name IN ('', 'SQL Plans', '_Total');
Cache Object Counts: 24,000 — healthy. Cache Pages: stable across all three checks. No eviction. But SQL Re-Compilations/sec: 22, then 19, then 24. Normal is under 2. Something was forcing mass recompilation without cache eviction — plans were being invalidated, not removed.
The Discovery
We checked what was triggering the recompilations.
SELECT TOP 10 OBJECT_NAME(st.objectid) AS proc_name, qs.plan_generation_num, qs.creation_time, qs.execution_count FROM sys.dm_exec_query_stats qs CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st WHERE qs.creation_time > DATEADD(MINUTE, -5, GETDATE()) ORDER BY qs.plan_generation_num DESC;
Every plan referencing dbo.daily_transactions had been recompiled. plan_generation_num was 8, 12, 15 on various plans. The ETL had inserted 2.1 million rows into that table, crossing the auto-update statistics threshold (500 + 20% of table rows). SQL Server triggered a synchronous statistics update, which invalidated every cached plan referencing the table. 2,000 queries recompiled within 90 seconds.
We verified this three times: checked sys.dm_db_stats_properties for last_updated timestamp on the table's statistics — all showed the exact same timestamp during the CPU spike window. Confirmed across three stats objects on the same table.
Incident Timeline
| Time | Event |
|---|---|
| 08:00 | ETL batch starts loading into daily_transactions |
| 09:14 | Row count crosses auto-update threshold (2.1M rows inserted) |
| 09:14 | Synchronous stats update fires on 6 statistics objects |
| 09:15 | 2,000+ cached plans invalidated. Mass recompile begins. CPU 100%. |
| 09:16:30 | Recompilations complete. Plans re-cached. CPU returns to 35%. |
Root Cause
The thundering-herd recompile is one of SQL Server's most underappreciated failure modes. Unlike blocking chains or memory pressure, it does not appear in any single DMV as a clear signal. CPU spikes to 100%, but the offending sessions are not long-running — each compilation takes 200–500ms. The damage comes from the sheer volume: 2,000 compilations at 400ms average, with 8 schedulers processing them, means roughly 100 seconds of sustained 100% CPU. During that window, every new query also needs to compile — its plan was also invalidated — extending the storm.
The ETL loaded enough rows to cross the auto-update statistics threshold on a hot table referenced by 2,000+ cached plans. Statistics update was synchronous (the default), meaning it happened inline with the next query execution. Every plan was invalidated simultaneously, causing the thundering-herd recompile. The 90-second CPU spike was the cost of 2,000 simultaneous compilations.
The Fix
-- Enable asynchronous statistics updates ALTER DATABASE ProdDB SET AUTO_UPDATE_STATISTICS_ASYNC ON; -- Pre-update stats after ETL completes, before app traffic UPDATE STATISTICS dbo.daily_transactions WITH FULLSCAN; -- Verify the setting SELECT name, is_auto_update_stats_async_on FROM sys.databases WHERE name = 'ProdDB';
Prevention
AUTO_UPDATE_STATISTICS_ASYNC is now ON — stale stats serve the current query while the update happens in the background, eliminating the thundering-herd recompile. The ETL job explicitly runs UPDATE STATISTICS WITH FULLSCAN on all loaded tables as its final step, before application traffic ramps up. Recompilations/sec is now a dashboard metric with an alert threshold at 10.
We also investigated why 2,000 plans were referencing a single table. Many were ad-hoc queries from the application ORM — each with slightly different parameter combinations generating unique plan cache entries. To reduce the blast radius of future stats updates, we enabled OPTIMIZE FOR AD HOC WORKLOADS, which stores a stub for first-time ad-hoc queries and only caches the full plan on second execution. This reduced the plan cache from 24,000 entries to 8,000, concentrating the entries on queries that actually repeat.
The underlying design issue was the ETL loading directly into the OLTP table. A better pattern — which we implemented the following sprint — stages data into a separate table, runs UPDATE STATISTICS on the stage table, then uses a partition switch to atomically move data into the production table. The partition switch does not trigger auto-update on the production statistics, eliminating the recompile storm entirely. — stale stats serve the current query while the update happens in the background. The ETL job explicitly runs UPDATE STATISTICS WITH FULLSCAN on all loaded tables as its final step, before application traffic ramps up. Recompilations/sec is now a dashboard metric with an alert threshold at 10.