TempDB Allocation Contention Stalled Our OLTP Workload

Monday morning. 09:14. Twelve batch jobs fired simultaneously as they do every weekday — pulling weekend transaction data into reporting tables using stored procedures that had run without issue for two years. Within 90 seconds, our OLTP application slowed to a crawl. Order entry screens froze. API response times went from 40ms to 14 seconds. The on-call alert fired at 09:15.

CPU was at 6%. Memory was healthy. No blocking chains. No long-running queries in sys.dm_exec_requests. The dashboards showed a system that looked completely idle — on a server that was visibly on fire.

The Alert

The monitoring alert was application-side: order API p99 latency exceeded 10 seconds. Not a database alert. The database metrics — CPU, memory, connection count, PLE — all looked normal. That mismatch was the first signal that the problem was not in any metric we were watching.

The diagnostic gap

Standard SQL Server monitoring covers CPU, memory, blocking, and long-running queries. It does not surface latch waits unless you specifically query sys.dm_os_wait_stats or sys.dm_os_latch_stats. This incident was invisible to our monitoring for exactly that reason.

First Look — Wait Stats

The right place to start when the system is slow but nothing looks obviously wrong is wait statistics. Not the monitoring dashboard — the raw DMV.

Current wait stats — what is the server actually waiting on right now

SQL Server

SELECT TOP 15
    wait_type,
    waiting_tasks_count,
    wait_time_ms,
    max_wait_time_ms,
    CAST(wait_time_ms * 1.0 / NULLIF(waiting_tasks_count, 0) AS DECIMAL(10,2)) AS avg_wait_ms
FROM   sys.dm_os_wait_stats
WHERE  wait_type NOT IN (
    'SLEEP_TASK', 'BROKER_TO_FLUSH', 'BROKER_EVENTHANDLER',
    'CHECKPOINT_QUEUE', 'DBMIRROR_EVENTS_QUEUE', 'SQLTRACE_BUFFER_FLUSH',
    'CLR_AUTO_EVENT', 'DISPATCHER_QUEUE_SEMAPHORE', 'XE_DISPATCHER_WAIT',
    'WAITFOR', 'LAZYWRITER_SLEEP', 'LOGMGR_QUEUE', 'ONDEMAND_TASK_QUEUE',
    'REQUEST_FOR_DEADLOCK_SEARCH', 'RESOURCE_QUEUE', 'SERVER_IDLE_CHECK',
    'SLEEP_DBSTARTUP', 'SLEEP_DCOMSTARTUP', 'SLEEP_MASTERDBREADY',
    'SLEEP_MASTERMDREADY', 'SLEEP_MASTERUPGRADED', 'SLEEP_MSDBSTARTUP',
    'SLEEP_TEMPDBSTARTUP', 'SNI_HTTP_ACCEPT', 'SP_SERVER_DIAGNOSTICS_SLEEP',
    'SQLTRACE_BUFFER_FLUSH', 'WAIT_XTP_OFFLINE_CKPT_NEW_LOG',
    'XE_TIMER_EVENT', 'BROKER_RECEIVE_WAITFOR'
)
ORDER BY wait_time_ms DESC;

The result came back in under a second. The top wait type, by an enormous margin: PAGELATCH_EX. Wait time: 847,293ms accumulated since the last SQL Server restart. Waiting tasks: 4,218. Average wait per task: 200ms. For a latch that should release in microseconds.

The Discovery — TempDB Allocation Pages

PAGELATCH_EX means a session is waiting for an exclusive latch on a data page. The next question is which pages. Latch contention on random data pages points to one problem. Latch contention specifically on TempDB allocation pages points to exactly one problem: too few TempDB data files for the level of concurrency.

Identify which pages the contention is on — confirms TempDB allocation pages

SQL Server

SELECT
    t1.resource_type,
    t1.resource_description,
    t1.request_mode,
    t1.request_status,
    t2.wait_type,
    t2.wait_duration_ms,
    t2.session_id,
    t2.blocking_session_id,
    t3.text          AS sql_text
FROM       sys.dm_os_waiting_tasks   t2
JOIN       sys.dm_exec_requests      t4  ON  t4.session_id   = t2.session_id
CROSS APPLY sys.dm_exec_sql_text(t4.sql_handle) t3
LEFT JOIN  sys.dm_os_lock_stats      t1  ON  t1.resource_associated_entity_id = t2.resource_address
WHERE  t2.wait_type LIKE 'PAGELATCH%'
ORDER BY  t2.wait_duration_ms DESC;

The resource_description column told the story: every session was waiting on pages 1:1, 1:2, and 1:3 of database ID 2. Database ID 2 is TempDB. Pages 1, 2, and 3 in TempDB are the global allocation map (GAM), shared global allocation map (SGAM), and page free space (PFS) pages — the allocation bitmap pages that SQL Server updates every time it allocates a new extent or page in TempDB.

Twelve stored procedures, each creating temporary tables and table variables, all hammering the same three allocation pages through a single TempDB data file. Every allocation serialised behind an exclusive latch. That was the 14-second API response time.

🚨 Why a single TempDB file serialises all allocation

SQL Server's allocation pages — GAM, SGAM, PFS — are spaced every 64,000 pages (approximately 500MB) within each data file. With one TempDB file, every concurrent object creation waits for an exclusive latch on the same page. With multiple files, allocations are distributed across files by a round-robin algorithm, so concurrent sessions compete on different pages rather than the same one.

Root Cause

This server was a default SQL Server installation from three years ago. Default installation creates TempDB with a single data file. The stored procedures that ran every weekday morning all created temporary tables — correct and appropriate use of TempDB. The problem was never the stored procedures. The problem was a single TempDB file configuration that is wrong for any server handling more than a handful of concurrent connections.

The batch job concurrency had been gradually increasing over two years as the business grew. For most of that time, the contention existed but was not severe enough to trigger application-level symptoms. Monday morning's batch — twelve concurrent jobs where previously there had been eight — crossed the threshold where contention became visible as a production incident.

TempDB files before	TempDB files after	Peak PAGELATCH_EX wait	API p99 at peak batch
1	4	200ms avg per task	< 2ms avg per task
Single allocation path	Round-robin across 4 files	14s API p99	42ms API p99

Immediate Fix

Adding TempDB data files does not require a restart and takes effect immediately. The number of files to add depends on the number of logical CPU cores — the recommendation is one file per core up to eight cores, then eight files maximum beyond that. This server had four cores.

Add three TempDB data files — adjust paths to match your TempDB location

SQL Server

-- First, confirm where your current TempDB files live
SELECT
    name,
    physical_name,
    size * 8 / 1024  AS size_mb,
    growth * 8 / 1024 AS growth_mb
FROM   tempdb.sys.database_files
WHERE  type_desc = 'ROWS';

-- Add three more data files, matching size and growth of the existing file
-- Replace D:\TempDB\ with your actual TempDB path
ALTER DATABASE tempdb
ADD FILE (
    NAME     = N'tempdev2',
    FILENAME = N'D:\TempDB\tempdb2.mdf',
    SIZE     = 8192MB,
    FILEGROWTH = 512MB
);

ALTER DATABASE tempdb
ADD FILE (
    NAME     = N'tempdev3',
    FILENAME = N'D:\TempDB\tempdb3.mdf',
    SIZE     = 8192MB,
    FILEGROWTH = 512MB
);

ALTER DATABASE tempdb
ADD FILE (
    NAME     = N'tempdev4',
    FILENAME = N'D:\TempDB\tempdb4.mdf',
    SIZE     = 8192MB,
    FILEGROWTH = 512MB
);

-- Verify all four files are now present
SELECT
    name,
    physical_name,
    size * 8 / 1024 AS size_mb,
    state_desc
FROM   tempdb.sys.database_files
WHERE  type_desc = 'ROWS';

The PAGELATCH_EX waits dropped to near zero within 60 seconds of adding the files. API p99 returned to 42ms. The batch jobs completed in 4 minutes instead of timing out.

Size all TempDB files equally

SQL Server's proportional fill algorithm allocates to files based on free space ratio. If files are different sizes, the larger file receives more allocations — defeating the purpose of multiple files. Set all TempDB data files to the same initial size and the same autogrowth increment.

Prevention — Permanent Configuration

Two additional changes shipped the same day to prevent recurrence and improve future diagnosability.

Enable trace flag 1118 — forces uniform extent allocation, reduces SGAM contention

SQL Server

-- SQL Server 2014 and earlier: enable trace flag 1118 globally
-- SQL Server 2016 and later: this is the default behaviour, no trace flag needed
DBCC TRACEON(1118, -1);

-- Verify it is active
DBCC TRACESTATUS(1118);

Monitoring query — add to your alerting stack, fires when PAGELATCH contention returns

SQL Server

-- Run every 60 seconds. Alert if avg_wait_ms exceeds 5ms for PAGELATCH waits on TempDB.
-- If it does, the file count needs revisiting as concurrency grows.
SELECT
    wait_type,
    waiting_tasks_count,
    wait_time_ms,
    CAST(
        wait_time_ms * 1.0 / NULLIF(waiting_tasks_count, 0)
        AS DECIMAL(10, 2)
    )                    AS avg_wait_ms,
    max_wait_time_ms
FROM   sys.dm_os_wait_stats
WHERE  wait_type IN ('PAGELATCH_EX', 'PAGELATCH_SH', 'PAGELATCH_UP')
  AND  waiting_tasks_count > 0
ORDER BY wait_time_ms DESC;

On TempDB file count and SQL Server 2016+

SQL Server 2016 and later ship with multiple TempDB files configured by default during setup — the installer prompts for file count based on detected CPU cores. If you inherited a pre-2016 instance or skipped that setup screen, you are likely running a single file regardless of SQL Server version. Check tempdb.sys.database_files on every instance you manage. Single-file TempDB on a server with more than four concurrent connections is a misconfiguration, not a scaling problem.

Incident Timeline

Time	Event
09:14:00	12 batch stored procedures fire simultaneously
09:14:22	PAGELATCH_EX waits begin accumulating on TempDB pages 1, 2, 3
09:14:50	OLTP application latency starts climbing — order API at 800ms p99
09:15:03	Alert fires — API p99 exceeded 10 seconds
09:15:40	On-call engineer checks CPU, memory, blocking — all clear
09:17:10	`sys.dm_os_wait_stats` queried — PAGELATCH_EX identified as top wait
09:19:45	TempDB allocation pages confirmed as contention source
09:22:00	Three additional TempDB data files added online — no restart
09:23:05	PAGELATCH_EX waits drop to near zero. API p99 returns to 42ms
09:27:00	Batch jobs complete normally. Incident closed.

James Kowalski

Senior DBA · SQL Server · 12 years · QueryTuning

James has spent 12 years managing SQL Server for financial services and e-commerce companies. He specialises in blocking analysis, plan cache investigation, and the kind of incidents that are obvious in retrospect but invisible during the outage.

TempDB Allocation Contention Stalled Our Entire OLTP Workload — One Configuration Line Fixed It

The Alert

First Look — Wait Stats

The Discovery — TempDB Allocation Pages

Root Cause

Immediate Fix

Prevention — Permanent Configuration

Incident Timeline

The Same Problem on Other Engines