58 Seconds. That's How Long Our Checkout Was

Black Friday. 11:23 AM. Checkout API p99 jumped from 180ms to over 90 seconds. Orders queuing. The engineering channel lit up. Forty-two alerts in under two minutes.

At 11:24 AM — 58 seconds after the first alert — the blocking cleared on its own. Performance returned to baseline instantly. The developer had closed his SSMS tab.

The Alert

Monitoring system fired at 11:23:05. Threshold: checkout p99 > 10 seconds. By the time the on-call engineer had the first query open, p99 was sitting at 92 seconds and 847 checkout requests were queued. The database CPU was flat at 4%. Memory was fine. Connections were within normal range. Nothing looked obviously wrong at the infrastructure level.

What the alert showed

checkout_api p99_latency_ms = 92,400 — every checkout request was waiting. Not failing. Waiting. That distinction mattered for the diagnosis.

First Hypothesis: Application Deployment

A deployment had gone out at 10:45 AM — 38 minutes before the incident. First assumption: the deployment had introduced a slow query. We pulled the slow query log on the primary. Nothing. No queries over 500ms in the last 10 minutes. We checked the deployment diff. No database schema changes. No stored procedure modifications.

That ruled out the deployment. The problem was not in the queries. The queries were not running — they were waiting.

Second Hypothesis: Lock Contention

The wait metric showed LCK_M_X — exclusive lock waits — across every queued session. Something was holding a lock on the orders table that 847 application threads needed to update. We ran the blocking chain query.

The query that found the blocker in under 10 seconds

SQL Server

SELECT
  r.session_id             AS blocked_spid,
  r.blocking_session_id    AS blocker_spid,
  r.wait_type,
  r.wait_time / 1000.0     AS wait_seconds,
  s.login_name,
  s.host_name,
  s.program_name,
  s.status                 AS blocker_status,
  DATEDIFF(second, s.last_request_start_time, GETDATE()) AS blocker_idle_seconds,
  SUBSTRING(st.text, 1, 200) AS blocker_last_sql
FROM       sys.dm_exec_requests   r
JOIN       sys.dm_exec_sessions   s  ON s.session_id = r.blocking_session_id
CROSS APPLY sys.dm_exec_sql_text(s.most_recent_sql_handle) st
WHERE  r.blocking_session_id > 0
ORDER BY r.wait_time DESC;

The Discovery

The result came back in milliseconds. One row dominated: SPID 73. Status: sleeping. Idle for 37 seconds. Login: a developer's Windows account. Host: a developer workstation. Program: Microsoft SQL Server Management Studio. Last SQL: SELECT * FROM orders WHERE order_status = 'pending'.

A developer had opened SSMS directly on production to check something, run a SELECT on the orders table, and then — based on the timestamp — had been distracted by a Slack message.

SSMS wraps every query in an implicit transaction. The SELECT acquired shared locks on the orders table pages. The developer walked away. For 37 seconds before we found it, every UPDATE from the checkout flow was queuing behind those shared locks.

Incident Timeline

Time	Event
11:22:47	Developer opens SSMS on production, runs SELECT on orders table
11:22:47	Implicit transaction begins — shared locks held on orders pages
11:22:49	First checkout UPDATE queues — cannot acquire exclusive lock
11:23:00	p99 exceeds alert threshold. Monitoring fires.
11:23:05	On-call alert sent
11:23:18	On-call engineer opens blocking chain query
11:23:22	SPID 73 identified — developer SSMS session, idle 37 seconds
11:23:48	Developer notices SSMS tab, closes it — transaction auto-rolls back
11:23:49	All 847 queued checkouts release simultaneously
11:24:00	p99 returns to 180ms. Total impact: 58 seconds.

Root Cause

Two independent failures combined. First: SSMS implicit transaction mode — every query window holds an open transaction until explicitly committed or the window is closed. A SELECT that touches a table holds shared locks for the entire duration. Second: no idle-in-transaction timeout on the production database. A session could hold locks indefinitely with no automatic cleanup.

Neither condition alone causes an outage. The deployment-day developer access combined with no timeout protection produced the incident.

🚨 The actual root cause is access policy, not SSMS behaviour

SSMS implicit transactions are documented behaviour. The gap was allowing direct developer access to a production database with write-capable credentials during peak traffic. The timeout is the safety net. The access policy is the real control.

What We Did Immediately

Kill the blocking session

SQL Server

-- Confirm what you are killing first
SELECT session_id, login_name, host_name, program_name,
       status, open_transaction_count
FROM   sys.dm_exec_sessions
WHERE  session_id = 73;

-- Kill it — rolls back the open transaction and releases all locks
KILL 73;

-- Verify the chain cleared
SELECT COUNT(*) AS still_blocked
FROM   sys.dm_exec_requests
WHERE  blocking_session_id > 0;

What We Changed

Two changes, shipped the same afternoon.

1 — Alert on idle sessions holding locks

SQL Server

-- Add to monitoring: fires if any session is sleeping with an open transaction
-- for more than 30 seconds. Runs every 30 seconds.
SELECT
  s.session_id,
  s.login_name,
  s.host_name,
  s.program_name,
  DATEDIFF(second, s.last_request_start_time, GETDATE()) AS idle_seconds,
  tst.open_transaction_count
FROM   sys.dm_exec_sessions             s
JOIN   sys.dm_tran_session_transactions  tst ON tst.session_id = s.session_id
WHERE  s.status      = 'sleeping'
  AND  tst.open_transaction_count > 0
  AND  DATEDIFF(second, s.last_request_start_time, GETDATE()) > 30;
-- Any row returned at 30 seconds = alert. At 60 seconds = page on-call engineer.

2 — Set LOCK_TIMEOUT so waiting queries fail fast instead of queuing forever

SQL Server

-- At the application connection level (set on open, before first query)
SET LOCK_TIMEOUT 5000;  -- 5 seconds. After this, throw error 1205 and let app retry.

-- The application must handle error 1205 and retry with exponential backoff.
-- But: 5 seconds of failed retries is better than 60 seconds of silent queuing.

The third change — restricting direct SSMS access to production during peak hours — took three weeks to get approved. It was the most important one. The two technical controls are the safety net around a gap that should not exist.

📝 On LOCK_TIMEOUT vs killing sessions

LOCK_TIMEOUT makes statements fail fast when they cannot acquire a lock within the timeout window. It does not prevent the blocker from holding its lock indefinitely — it just stops the victims from queuing silently. The victim sessions fail with error 1205 and the application must retry. This is correct behaviour: a failed transaction the application knows about is better than a silent queue that grows until the service times out. Ensure your application handles 1205 before enabling this in production.

James Kowalski

Senior DBA · SQL Server · 12 years · QueryTuning

James has spent 12 years managing SQL Server for financial services and e-commerce companies. He specialises in blocking analysis, plan cache investigation, and the kind of incidents that are obvious in retrospect but invisible during the outage.

58 Seconds. That's How Long Our Checkout Was Completely Frozen on Black Friday

The Alert

First Hypothesis: Application Deployment

Second Hypothesis: Lock Contention

The Discovery

Incident Timeline

Root Cause

What We Did Immediately

What We Changed

The Same Problem on Other Engines