Wednesday, 03:14. On-call alert: TempDB data files at 97% capacity. The server had 800GB allocated to TempDB. It was almost full. At 03:22 — eight minutes later — SQL Server ran out of TempDB space and started killing connections. The source: one query, running a sort that SQL Server had estimated would need 4GB but was actually processing 180GB of intermediate data.
The Alert
Monitoring fired on TempDB space at 97%. By the time the on-call engineer connected, disk writes to the TempDB volume were at 2.4GB/second sustained. A single query was writing continuously. CPU was 34% — low for the write volume, which indicated a sort operation, not a compute-heavy query.
First Hypothesis: Runaway Temp Table
A runaway process inserting into a temp table without bounds was the first assumption. We checked active temp table usage.
SELECT s.session_id, s.login_name, s.host_name, r.command, r.status, r.total_elapsed_time / 1000 AS elapsed_sec, tsu.user_objects_alloc_page_count * 8 / 1024 AS temp_table_mb, tsu.internal_objects_alloc_page_count * 8 / 1024 AS sort_spill_mb FROM sys.dm_exec_sessions s JOIN sys.dm_exec_requests r ON r.session_id = s.session_id JOIN sys.dm_db_task_space_usage tsu ON tsu.session_id = s.session_id WHERE tsu.internal_objects_alloc_page_count > 0 OR tsu.user_objects_alloc_page_count > 100000 ORDER BY (tsu.internal_objects_alloc_page_count + tsu.user_objects_alloc_page_count) DESC;
One session. SPID 284. user_objects_alloc_page_count: 0. internal_objects_alloc_page_count: 52,428,800 pages — 400GB. Not a temp table. A sort spill. The internal_objects column captures sort and hash join spills to TempDB.
The Discovery
The query behind SPID 284 — a monthly finance report, scheduled at 03:00 — was sorting 180GB of intermediate data. SQL Server had granted it 4GB of memory for the sort based on a cardinality estimate of 2 million rows. The actual intermediate dataset was 14 million rows. When the in-memory sort exhausted its 4GB grant, SQL Server began spilling to TempDB page by page, at 2.4GB per second, for 12 minutes.
SELECT qp.query_plan, qs.total_spills / qs.execution_count AS avg_spill_pages, qs.max_spills AS max_spill_pages, qs.total_grant_kb / qs.execution_count AS avg_grant_kb FROM sys.dm_exec_query_stats qs CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) qp WHERE st.text LIKE '%finance_monthly%' ORDER BY avg_spill_pages DESC;
The plan showed a Sort operator with a SpillLevel warning. Estimated rows: 2,100,000. Actual rows processed: 14,300,000. A 7x underestimate. The bad estimate came from a JOIN across three tables with correlated predicates — the cardinality estimation for correlated columns is notoriously inaccurate in SQL Server 2016 CE.
Incident Timeline
| Time | Event |
|---|---|
| 03:00:00 | Monthly finance report job starts |
| 03:00:08 | Query begins executing — memory grant: 4GB |
| 03:02:14 | In-memory sort exhausted. Spill begins to TempDB. |
| 03:14:20 | TempDB at 97% capacity. Alert fires. |
| 03:19:00 | On-call engineer identifies SPID 284 — 400GB spill, still growing |
| 03:22:30 | TempDB full. SQL Server kills connections. Outage begins. |
| 03:24:00 | KILL 284 executed. Spill stops. TempDB space released. |
| 03:26:00 | Normal connections resume. |
Root Cause
The cardinality estimator underestimated intermediate row counts by 7x due to correlated join predicates. The memory grant was sized for 2.1 million rows. The sort processed 14.3 million rows. The entire excess spilled to TempDB at disk I/O speed. TempDB had no size cap — it expanded until the volume was full.
The Fix
-- Immediate: kill the session KILL 284; -- Permanent: give the sort enough memory to avoid spilling -- MIN_GRANT_PERCENT forces a minimum grant regardless of estimate SELECT ... ORDER BY ... OPTION (MIN_GRANT_PERCENT = 10); -- 10% of max server memory -- Or use MAXDOP 1 + larger grant for serial sort: OPTION (MAXDOP 1, MIN_GRANT_PERCENT = 25);
-- Restrict max size on TempDB files -- Alert long before the cap is reached ALTER DATABASE tempdb MODIFY FILE (NAME = tempdev, MAXSIZE = 200GB); -- Monitor TempDB space from a job every 5 minutes: SELECT SUM(unallocated_extent_page_count) * 8 / 1024 AS free_mb, SUM(version_store_reserved_page_count) * 8 / 1024 AS version_store_mb, SUM(internal_object_reserved_page_count) * 8 / 1024 AS internal_obj_mb FROM sys.dm_db_file_space_usage;
Prevention
TempDB files now have a hard cap. The finance report runs with an explicit memory hint sized to its actual data volume. A pre-execution query estimates intermediate row count and aborts with an error if the estimate exceeds 5 million rows — requiring manual approval before the large sort runs in production.