Monday morning. 09:14. Twelve batch jobs fired simultaneously as they do every weekday — pulling weekend transaction data into reporting tables using stored procedures that had run without issue for two years. Within 90 seconds, our OLTP application slowed to a crawl. Order entry screens froze. API response times went from 40ms to 14 seconds. The on-call alert fired at 09:15.
CPU was at 6%. Memory was healthy. No blocking chains. No long-running queries in sys.dm_exec_requests. The dashboards showed a system that looked completely idle — on a server that was visibly on fire.
The Alert
The monitoring alert was application-side: order API p99 latency exceeded 10 seconds. Not a database alert. The database metrics — CPU, memory, connection count, PLE — all looked normal. That mismatch was the first signal that the problem was not in any metric we were watching.
Standard SQL Server monitoring covers CPU, memory, blocking, and long-running queries. It does not surface latch waits unless you specifically query sys.dm_os_wait_stats or sys.dm_os_latch_stats. This incident was invisible to our monitoring for exactly that reason.
First Look — Wait Stats
The right place to start when the system is slow but nothing looks obviously wrong is wait statistics. Not the monitoring dashboard — the raw DMV.
SELECT TOP 15 wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms, CAST(wait_time_ms * 1.0 / NULLIF(waiting_tasks_count, 0) AS DECIMAL(10,2)) AS avg_wait_ms FROM sys.dm_os_wait_stats WHERE wait_type NOT IN ( 'SLEEP_TASK', 'BROKER_TO_FLUSH', 'BROKER_EVENTHANDLER', 'CHECKPOINT_QUEUE', 'DBMIRROR_EVENTS_QUEUE', 'SQLTRACE_BUFFER_FLUSH', 'CLR_AUTO_EVENT', 'DISPATCHER_QUEUE_SEMAPHORE', 'XE_DISPATCHER_WAIT', 'WAITFOR', 'LAZYWRITER_SLEEP', 'LOGMGR_QUEUE', 'ONDEMAND_TASK_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'RESOURCE_QUEUE', 'SERVER_IDLE_CHECK', 'SLEEP_DBSTARTUP', 'SLEEP_DCOMSTARTUP', 'SLEEP_MASTERDBREADY', 'SLEEP_MASTERMDREADY', 'SLEEP_MASTERUPGRADED', 'SLEEP_MSDBSTARTUP', 'SLEEP_TEMPDBSTARTUP', 'SNI_HTTP_ACCEPT', 'SP_SERVER_DIAGNOSTICS_SLEEP', 'SQLTRACE_BUFFER_FLUSH', 'WAIT_XTP_OFFLINE_CKPT_NEW_LOG', 'XE_TIMER_EVENT', 'BROKER_RECEIVE_WAITFOR' ) ORDER BY wait_time_ms DESC;
The result came back in under a second. The top wait type, by an enormous margin: PAGELATCH_EX. Wait time: 847,293ms accumulated since the last SQL Server restart. Waiting tasks: 4,218. Average wait per task: 200ms. For a latch that should release in microseconds.
The Discovery — TempDB Allocation Pages
PAGELATCH_EX means a session is waiting for an exclusive latch on a data page. The next question is which pages. Latch contention on random data pages points to one problem. Latch contention specifically on TempDB allocation pages points to exactly one problem: too few TempDB data files for the level of concurrency.
SELECT t1.resource_type, t1.resource_description, t1.request_mode, t1.request_status, t2.wait_type, t2.wait_duration_ms, t2.session_id, t2.blocking_session_id, t3.text AS sql_text FROM sys.dm_os_waiting_tasks t2 JOIN sys.dm_exec_requests t4 ON t4.session_id = t2.session_id CROSS APPLY sys.dm_exec_sql_text(t4.sql_handle) t3 LEFT JOIN sys.dm_os_lock_stats t1 ON t1.resource_associated_entity_id = t2.resource_address WHERE t2.wait_type LIKE 'PAGELATCH%' ORDER BY t2.wait_duration_ms DESC;
The resource_description column told the story: every session was waiting on pages 1:1, 1:2, and 1:3 of database ID 2. Database ID 2 is TempDB. Pages 1, 2, and 3 in TempDB are the global allocation map (GAM), shared global allocation map (SGAM), and page free space (PFS) pages — the allocation bitmap pages that SQL Server updates every time it allocates a new extent or page in TempDB.
Twelve stored procedures, each creating temporary tables and table variables, all hammering the same three allocation pages through a single TempDB data file. Every allocation serialised behind an exclusive latch. That was the 14-second API response time.
SQL Server's allocation pages — GAM, SGAM, PFS — are spaced every 64,000 pages (approximately 500MB) within each data file. With one TempDB file, every concurrent object creation waits for an exclusive latch on the same page. With multiple files, allocations are distributed across files by a round-robin algorithm, so concurrent sessions compete on different pages rather than the same one.
Root Cause
This server was a default SQL Server installation from three years ago. Default installation creates TempDB with a single data file. The stored procedures that ran every weekday morning all created temporary tables — correct and appropriate use of TempDB. The problem was never the stored procedures. The problem was a single TempDB file configuration that is wrong for any server handling more than a handful of concurrent connections.
The batch job concurrency had been gradually increasing over two years as the business grew. For most of that time, the contention existed but was not severe enough to trigger application-level symptoms. Monday morning's batch — twelve concurrent jobs where previously there had been eight — crossed the threshold where contention became visible as a production incident.
| TempDB files before | TempDB files after | Peak PAGELATCH_EX wait | API p99 at peak batch |
|---|---|---|---|
| 1 | 4 | 200ms avg per task | < 2ms avg per task |
| Single allocation path | Round-robin across 4 files | 14s API p99 | 42ms API p99 |
Immediate Fix
Adding TempDB data files does not require a restart and takes effect immediately. The number of files to add depends on the number of logical CPU cores — the recommendation is one file per core up to eight cores, then eight files maximum beyond that. This server had four cores.
-- First, confirm where your current TempDB files live SELECT name, physical_name, size * 8 / 1024 AS size_mb, growth * 8 / 1024 AS growth_mb FROM tempdb.sys.database_files WHERE type_desc = 'ROWS'; -- Add three more data files, matching size and growth of the existing file -- Replace D:\TempDB\ with your actual TempDB path ALTER DATABASE tempdb ADD FILE ( NAME = N'tempdev2', FILENAME = N'D:\TempDB\tempdb2.mdf', SIZE = 8192MB, FILEGROWTH = 512MB ); ALTER DATABASE tempdb ADD FILE ( NAME = N'tempdev3', FILENAME = N'D:\TempDB\tempdb3.mdf', SIZE = 8192MB, FILEGROWTH = 512MB ); ALTER DATABASE tempdb ADD FILE ( NAME = N'tempdev4', FILENAME = N'D:\TempDB\tempdb4.mdf', SIZE = 8192MB, FILEGROWTH = 512MB ); -- Verify all four files are now present SELECT name, physical_name, size * 8 / 1024 AS size_mb, state_desc FROM tempdb.sys.database_files WHERE type_desc = 'ROWS';
The PAGELATCH_EX waits dropped to near zero within 60 seconds of adding the files. API p99 returned to 42ms. The batch jobs completed in 4 minutes instead of timing out.
SQL Server's proportional fill algorithm allocates to files based on free space ratio. If files are different sizes, the larger file receives more allocations — defeating the purpose of multiple files. Set all TempDB data files to the same initial size and the same autogrowth increment.
Prevention — Permanent Configuration
Two additional changes shipped the same day to prevent recurrence and improve future diagnosability.
-- SQL Server 2014 and earlier: enable trace flag 1118 globally -- SQL Server 2016 and later: this is the default behaviour, no trace flag needed DBCC TRACEON(1118, -1); -- Verify it is active DBCC TRACESTATUS(1118);
-- Run every 60 seconds. Alert if avg_wait_ms exceeds 5ms for PAGELATCH waits on TempDB. -- If it does, the file count needs revisiting as concurrency grows. SELECT wait_type, waiting_tasks_count, wait_time_ms, CAST( wait_time_ms * 1.0 / NULLIF(waiting_tasks_count, 0) AS DECIMAL(10, 2) ) AS avg_wait_ms, max_wait_time_ms FROM sys.dm_os_wait_stats WHERE wait_type IN ('PAGELATCH_EX', 'PAGELATCH_SH', 'PAGELATCH_UP') AND waiting_tasks_count > 0 ORDER BY wait_time_ms DESC;
SQL Server 2016 and later ship with multiple TempDB files configured by default during setup — the installer prompts for file count based on detected CPU cores. If you inherited a pre-2016 instance or skipped that setup screen, you are likely running a single file regardless of SQL Server version. Check tempdb.sys.database_files on every instance you manage. Single-file TempDB on a server with more than four concurrent connections is a misconfiguration, not a scaling problem.
Incident Timeline
| Time | Event |
|---|---|
| 09:14:00 | 12 batch stored procedures fire simultaneously |
| 09:14:22 | PAGELATCH_EX waits begin accumulating on TempDB pages 1, 2, 3 |
| 09:14:50 | OLTP application latency starts climbing — order API at 800ms p99 |
| 09:15:03 | Alert fires — API p99 exceeded 10 seconds |
| 09:15:40 | On-call engineer checks CPU, memory, blocking — all clear |
| 09:17:10 | sys.dm_os_wait_stats queried — PAGELATCH_EX identified as top wait |
| 09:19:45 | TempDB allocation pages confirmed as contention source |
| 09:22:00 | Three additional TempDB data files added online — no restart |
| 09:23:05 | PAGELATCH_EX waits drop to near zero. API p99 returns to 42ms |
| 09:27:00 | Batch jobs complete normally. Incident closed. |