Black Friday. 11:23 AM. Checkout API p99 jumped from 180ms to over 90 seconds. Orders queuing. The engineering channel lit up. Forty-two alerts in under two minutes.
At 11:24 AM — 58 seconds after the first alert — the blocking cleared on its own. Performance returned to baseline instantly. The developer had closed his SSMS tab.
The Alert
Monitoring system fired at 11:23:05. Threshold: checkout p99 > 10 seconds. By the time the on-call engineer had the first query open, p99 was sitting at 92 seconds and 847 checkout requests were queued. The database CPU was flat at 4%. Memory was fine. Connections were within normal range. Nothing looked obviously wrong at the infrastructure level.
checkout_api p99_latency_ms = 92,400 — every checkout request was waiting. Not failing. Waiting. That distinction mattered for the diagnosis.
First Hypothesis: Application Deployment
A deployment had gone out at 10:45 AM — 38 minutes before the incident. First assumption: the deployment had introduced a slow query. We pulled the slow query log on the primary. Nothing. No queries over 500ms in the last 10 minutes. We checked the deployment diff. No database schema changes. No stored procedure modifications.
That ruled out the deployment. The problem was not in the queries. The queries were not running — they were waiting.
Second Hypothesis: Lock Contention
The wait metric showed LCK_M_X — exclusive lock waits — across every queued session. Something was holding a lock on the orders table that 847 application threads needed to update. We ran the blocking chain query.
SELECT r.session_id AS blocked_spid, r.blocking_session_id AS blocker_spid, r.wait_type, r.wait_time / 1000.0 AS wait_seconds, s.login_name, s.host_name, s.program_name, s.status AS blocker_status, DATEDIFF(second, s.last_request_start_time, GETDATE()) AS blocker_idle_seconds, SUBSTRING(st.text, 1, 200) AS blocker_last_sql FROM sys.dm_exec_requests r JOIN sys.dm_exec_sessions s ON s.session_id = r.blocking_session_id CROSS APPLY sys.dm_exec_sql_text(s.most_recent_sql_handle) st WHERE r.blocking_session_id > 0 ORDER BY r.wait_time DESC;
The Discovery
The result came back in milliseconds. One row dominated: SPID 73. Status: sleeping. Idle for 37 seconds. Login: a developer's Windows account. Host: a developer workstation. Program: Microsoft SQL Server Management Studio. Last SQL: SELECT * FROM orders WHERE order_status = 'pending'.
A developer had opened SSMS directly on production to check something, run a SELECT on the orders table, and then — based on the timestamp — had been distracted by a Slack message.
SSMS wraps every query in an implicit transaction. The SELECT acquired shared locks on the orders table pages. The developer walked away. For 37 seconds before we found it, every UPDATE from the checkout flow was queuing behind those shared locks.
Incident Timeline
| Time | Event |
|---|---|
| 11:22:47 | Developer opens SSMS on production, runs SELECT on orders table |
| 11:22:47 | Implicit transaction begins — shared locks held on orders pages |
| 11:22:49 | First checkout UPDATE queues — cannot acquire exclusive lock |
| 11:23:00 | p99 exceeds alert threshold. Monitoring fires. |
| 11:23:05 | On-call alert sent |
| 11:23:18 | On-call engineer opens blocking chain query |
| 11:23:22 | SPID 73 identified — developer SSMS session, idle 37 seconds |
| 11:23:48 | Developer notices SSMS tab, closes it — transaction auto-rolls back |
| 11:23:49 | All 847 queued checkouts release simultaneously |
| 11:24:00 | p99 returns to 180ms. Total impact: 58 seconds. |
Root Cause
Two independent failures combined. First: SSMS implicit transaction mode — every query window holds an open transaction until explicitly committed or the window is closed. A SELECT that touches a table holds shared locks for the entire duration. Second: no idle-in-transaction timeout on the production database. A session could hold locks indefinitely with no automatic cleanup.
Neither condition alone causes an outage. The deployment-day developer access combined with no timeout protection produced the incident.
SSMS implicit transactions are documented behaviour. The gap was allowing direct developer access to a production database with write-capable credentials during peak traffic. The timeout is the safety net. The access policy is the real control.
What We Did Immediately
-- Confirm what you are killing first SELECT session_id, login_name, host_name, program_name, status, open_transaction_count FROM sys.dm_exec_sessions WHERE session_id = 73; -- Kill it — rolls back the open transaction and releases all locks KILL 73; -- Verify the chain cleared SELECT COUNT(*) AS still_blocked FROM sys.dm_exec_requests WHERE blocking_session_id > 0;
What We Changed
Two changes, shipped the same afternoon.
-- Add to monitoring: fires if any session is sleeping with an open transaction -- for more than 30 seconds. Runs every 30 seconds. SELECT s.session_id, s.login_name, s.host_name, s.program_name, DATEDIFF(second, s.last_request_start_time, GETDATE()) AS idle_seconds, tst.open_transaction_count FROM sys.dm_exec_sessions s JOIN sys.dm_tran_session_transactions tst ON tst.session_id = s.session_id WHERE s.status = 'sleeping' AND tst.open_transaction_count > 0 AND DATEDIFF(second, s.last_request_start_time, GETDATE()) > 30; -- Any row returned at 30 seconds = alert. At 60 seconds = page on-call engineer.
-- At the application connection level (set on open, before first query) SET LOCK_TIMEOUT 5000; -- 5 seconds. After this, throw error 1205 and let app retry. -- The application must handle error 1205 and retry with exponential backoff. -- But: 5 seconds of failed retries is better than 60 seconds of silent queuing.
The third change — restricting direct SSMS access to production during peak hours — took three weeks to get approved. It was the most important one. The two technical controls are the safety net around a gap that should not exist.
LOCK_TIMEOUT makes statements fail fast when they cannot acquire a lock within the timeout window. It does not prevent the blocker from holding its lock indefinitely — it just stops the victims from queuing silently. The victim sessions fail with error 1205 and the application must retry. This is correct behaviour: a failed transaction the application knows about is better than a silent queue that grows until the service times out. Ensure your application handles 1205 before enabling this in production.