Thursday, 14:22. Application monitoring flagged response times exceeding 30 seconds across 14 microservices. The database had 128GB RAM, 96GB allocated to SQL Server. Memory at 94% — normal for a warm buffer pool. But the wait stats told a completely different story.
The Alert
SELECT wait_type, waiting_tasks_count, wait_time_ms / 1000 AS wait_sec FROM sys.dm_os_wait_stats WHERE wait_type IN ('RESOURCE_SEMAPHORE', 'RESOURCE_SEMAPHORE_QUERY_COMPILE') ORDER BY wait_time_ms DESC;
RESOURCE_SEMAPHORE: 340 waiting tasks. We ran the same query again thirty seconds later — 347 tasks. Again sixty seconds after that — 353. The count was climbing steadily. Something was holding the entire memory grant pool hostage.
First Hypothesis: Buffer Pool Pressure
Maybe a scan-heavy query was evicting pages. We checked buffer pool health three times, 30 seconds apart.
SELECT object_name, counter_name, cntr_value FROM sys.dm_os_performance_counters WHERE counter_name IN ( 'Page life expectancy', 'Buffer cache hit ratio', 'Memory Grants Pending', 'Memory Grants Outstanding' );
Buffer cache hit ratio: 99.7% — all three runs. Page life expectancy: 48,000 seconds — stable. Buffer pool was fine. But Memory Grants Pending: 340, 344, 351. Memory Grants Outstanding: 1. One session holding a grant so large nobody else could get memory. Hypothesis ruled out — buffer pool healthy, workspace pool was the chokepoint.
Second Hypothesis: Runaway Parallel Query
SELECT TOP 5 mg.session_id, mg.granted_memory_kb / 1024 AS granted_mb, mg.used_memory_kb / 1024 AS used_mb, mg.max_used_memory_kb / 1024 AS max_used_mb, mg.dop, r.total_elapsed_time / 1000 AS elapsed_sec, SUBSTRING(st.text, (r.statement_start_offset/2)+1, ((CASE r.statement_end_offset WHEN -1 THEN DATALENGTH(st.text) ELSE r.statement_end_offset END - r.statement_start_offset)/2)+1 ) AS current_stmt FROM sys.dm_exec_query_memory_grants mg JOIN sys.dm_exec_requests r ON r.session_id = mg.session_id CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) st WHERE mg.grant_time IS NOT NULL ORDER BY mg.granted_memory_kb DESC;
SPID 412. Granted: 14,336 MB. Used: 3,200 MB. DOP: 1. Not parallel — serial. 14GB granted, only 3.2GB in use. The remaining 11GB locked in the grant, unavailable to everyone. We verified three times at 30-second intervals — granted_mb stayed at 14,336 each time, used_mb bounced between 3,100 and 3,300. Enormous and stable.
The query was a nightly reconciliation report rescheduled to 14:00 after a cron change the previous week. At night it had the whole server. During peak hours, it starved everything else.
The Discovery
SPID 412 had requested 14GB based on a cardinality estimate of 90 million rows. Actual data: 4.2 million rows. The optimizer estimated a hash join intermediate at 90 million because of a many-to-many join with no FK constraint and statistics 45 days stale. The query was spilling to TempDB despite using only 3.2GB of its 14GB grant — it could have finished with 2GB, but the grant system locked 14GB until query completion.
We verified three times: pulled the actual plan from sys.dm_exec_query_plan, checked estimated vs actual rows on the hash join operator, confirmed statistics last-updated date. Same results. 45-day-old stats, 21x overestimate, 14GB grant for a 3GB workload.
The mechanics of the damage deserve explanation. SQL Server's memory grant system works on a reservation model — when a query requests workspace memory for sorts, hashes, or other data-intensive operations, the engine reserves the full estimated amount at query start. That memory is subtracted from the workspace pool's available budget. Other queries needing grants must wait until enough budget is free. The grant is held for the full query lifetime regardless of actual consumption. In this case, 11GB of the 14GB grant sat completely unused — reserved but never touched — for 47 minutes while 340 sessions queued behind it. The workspace pool on this server was 24GB total. One query had locked 58% of it based on a wrong estimate.
We also investigated why the statistics were 45 days stale. Auto-update statistics was enabled, but the modification counter on the join column had not crossed the update threshold. The table had 180 million rows. SQL Server's auto-update threshold at that time was 500 + 20% of row count — meaning 36 million modifications were needed to trigger an update. The ETL had only modified 8 million rows over 45 days. The statistics were technically not "stale" by the engine's definition, but the data distribution had shifted enough to make the cardinality estimate wildly wrong.
Incident Timeline
| Time | Event |
|---|---|
| 14:00 | Reconciliation report starts — rescheduled from 02:00 |
| 14:00 | 14GB memory grant approved |
| 14:05 | Application queries start queueing on RESOURCE_SEMAPHORE |
| 14:22 | Alert: 340 sessions waiting, response times above 30s |
| 14:29 | KILL 412. Grant released. Queue drains in 11 seconds. |
Root Cause
Three conditions combined. A report rescheduled from overnight to peak hours. Statistics 45 days stale causing a 21x row overestimate. Resource Governor in default configuration — no cap on individual memory grants. The 14GB grant was the damage, not the query. The query only needed 3GB. Once granted, memory is reserved until completion regardless of usage.
The Fix
KILL 412; ALTER WORKLOAD GROUP [reporting] WITH (REQUEST_MAX_MEMORY_GRANT_PERCENT = 5); ALTER RESOURCE GOVERNOR RECONFIGURE; UPDATE STATISTICS dbo.transactions (IX_transactions_account_date) WITH FULLSCAN;
Prevention
Resource Governor now caps reporting queries at 5% of workspace memory. The report runs at 02:00 again. Statistics auto-update with ASYNC enabled, and a nightly job runs UPDATE STATISTICS WITH FULLSCAN on the ten highest-churn tables. Memory grants pending triggers a Grafana alert if it exceeds 10 for more than 60 seconds.
We also added a pre-execution gate. Before any report query runs, a lightweight estimation checks the largest join's expected intermediate rows. If the estimate exceeds 10 million, the report is flagged for manual review rather than submitted directly. The broader lesson: a query that runs safely at 02:00 with zero contention can be catastrophic at 14:00 with 340 concurrent sessions. Scheduling changes that move heavy queries to peak hours now require a memory impact assessment before approval.