Select your database above. This runbook covers the full path — from the first query to run when the alert fires, through finding the head-of-chain blocker, to safely terminating it and preventing recurrence.
Run this immediately. One query gives you the full picture: who's waiting, who's blocking them, how long both have been running, and what both queries are doing.
SELECT blocked.pid AS blocked_pid, blocked.usename AS blocked_user, NOW() - blocked.query_start AS blocked_duration, blocked.query AS blocked_query, blocking.pid AS blocking_pid, blocking.usename AS blocking_user, NOW() - blocking.query_start AS blocking_duration, blocking.query AS blocking_query, blocking.state AS blocking_state, blocking.wait_event_type, blocking.wait_event FROM pg_stat_activity blocked JOIN pg_stat_activity blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid)) WHERE CARDINALITY(pg_blocking_pids(blocked.pid)) > 0 ORDER BY blocked_duration DESC;
| blocking_state | Meaning | Urgency |
|---|---|---|
| active | Running a long query — will release when done | Medium — monitor progress |
| idle in transaction | Opened a transaction, then abandoned it | High — terminate now |
| idle in transaction (aborted) | Transaction failed but never rolled back | High — terminate now |
| idle | Finished but lock not released | High — terminate now |
In a lock chain — A blocks B, B blocks C, C blocks D — you only need to terminate A. Killing B or C requeues the wait. This query finds the session at the root.
WITH RECURSIVE lock_chain AS ( SELECT pid, pg_blocking_pids(pid) AS blockers, 0 AS depth FROM pg_stat_activity WHERE CARDINALITY(pg_blocking_pids(pid)) > 0 UNION ALL SELECT unnested, pg_blocking_pids(unnested), lc.depth + 1 FROM lock_chain lc, UNNEST(lc.blockers) unnested WHERE CARDINALITY(pg_blocking_pids(unnested)) > 0 ) SELECT DISTINCT ON (pid) pid, depth FROM lock_chain ORDER BY pid, depth DESC; -- Highest depth value = root blocker
Always try pg_cancel_backend() first. It sends a soft interrupt — cancels the current query but leaves the connection open. Only escalate to pg_terminate_backend() if cancellation doesn't clear the block, or if the state is idle in transaction.
Confirm the PID is not a replication sender (pg_stat_replication) or autovacuum launcher before terminating. Check application_name and backend_type. Killing either causes replication lag or table bloat.
-- Step 1: Verify what you're about to terminate SELECT pid, usename, application_name, backend_type, state, query FROM pg_stat_activity WHERE pid = 12441; -- replace with your blocking PID -- Step 2: Try cancel first (safer — doesn't kill the connection) SELECT pg_cancel_backend(12441); -- Wait 5–10 seconds. If still blocking, escalate: SELECT pg_terminate_backend(12441); -- Confirm the chain cleared SELECT COUNT(*) FROM pg_stat_activity WHERE CARDINALITY(pg_blocking_pids(pid)) > 0; -- Should return 0
Deadlocks auto-resolve in PostgreSQL — one transaction is rolled back by the system. But if deadlocks are recurring, you need the pattern. It's in the logs.
-- postgresql.conf — no overhead unless deadlocks actually occur log_lock_waits = on deadlock_timeout = 1s SELECT pg_reload_conf(); -- no restart needed -- Find tables currently involved in lock waits SELECT COALESCE(relname, '(transaction)') AS object, locktype, mode, COUNT(*) AS waiters FROM pg_locks l LEFT JOIN pg_class c ON c.oid = l.relation WHERE granted = FALSE GROUP BY 1,2,3 ORDER BY waiters DESC;
Most repeated lock incidents are caused by idle in transaction — application code that opens a transaction and then waits on something else. Set a hard timeout at the database level to automatically kill these before they cascade.
-- Kill idle-in-transaction sessions after 30 seconds ALTER ROLE app_user SET idle_in_transaction_session_timeout = '30s'; -- Backstop: kill any query running over 5 minutes ALTER ROLE app_user SET statement_timeout = '300s'; -- Apply immediately — no restart required SELECT pg_reload_conf(); SHOW idle_in_transaction_session_timeout;
Set idle_in_transaction_session_timeout per-role · Set statement_timeout per-role · Lock tables in consistent order across all code paths · Use SELECT FOR UPDATE SKIP LOCKED for queue patterns · Enable log_lock_waits permanently · Alert on pg_stat_activity where state = 'idle in transaction' and duration > 60s.
This PostgreSQL section is technically accurate for versions 9.6 through 17. The pg_blocking_pids() function was introduced in PostgreSQL 9.6 — on older versions, use a manual join on pg_locks via the relation OID approach instead.
On the recursive CTE for chain-walking: This works correctly. Be aware that on a system with hundreds of active locks, this CTE can be expensive. Run it quickly after confirming a chain exists, not as a polling query.
On idle_in_transaction_session_timeout: Setting this per-role rather than globally is the right call. If you set it globally, you risk killing legitimate long-running batch processes that belong to other roles. Scope it to your application role only.
One query across sys.dm_exec_requests, sys.dm_exec_sessions, and sys.dm_exec_sql_text. Gives you blockers, wait types, wait duration, and the actual query text.
SELECT r.session_id AS blocked_spid, r.blocking_session_id AS blocker_spid, r.wait_type, r.wait_time / 1000.0 AS wait_seconds, r.status, s.login_name, s.host_name, s.program_name, SUBSTRING(st.text, (r.statement_start_offset/2)+1, ((CASE r.statement_end_offset WHEN -1 THEN DATALENGTH(st.text) ELSE r.statement_end_offset END - r.statement_start_offset)/2)+1 ) AS current_statement FROM sys.dm_exec_requests r JOIN sys.dm_exec_sessions s ON s.session_id = r.session_id CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) st WHERE r.blocking_session_id > 0 ORDER BY r.wait_time DESC;
SELECT s.session_id, s.login_name, s.host_name, s.program_name, s.status, DATEDIFF(second, s.last_request_start_time, GETDATE()) AS idle_seconds, tst.open_transaction_count, st.text AS last_query FROM sys.dm_exec_sessions s JOIN sys.dm_tran_session_transactions tst ON tst.session_id = s.session_id CROSS APPLY sys.dm_exec_sql_text(s.most_recent_sql_handle) st WHERE s.session_id IN ( SELECT DISTINCT blocking_session_id FROM sys.dm_exec_requests WHERE blocking_session_id > 0 ) AND s.session_id NOT IN ( SELECT session_id FROM sys.dm_exec_requests WHERE blocking_session_id > 0 );
If the blocker has open_transaction_count > 0 and has been doing real write work, killing it will roll everything back. Confirm with the application team whether that's acceptable before proceeding.
-- Replace 55 with your blocking SPID KILL 55; -- Confirm the chain cleared SELECT COUNT(*) AS still_blocked FROM sys.dm_exec_requests WHERE blocking_session_id > 0; -- Expect 0
SELECT xdr.value('(event/@timestamp)[1]', 'datetime2') AS deadlock_time, xdr.query('.') AS deadlock_graph FROM ( SELECT CAST(target_data AS XML) AS target_data FROM sys.dm_xe_session_targets t JOIN sys.dm_xe_sessions s ON s.address = t.event_session_address WHERE s.name = 'system_health' AND t.target_name = 'ring_buffer' ) x CROSS APPLY target_data.nodes('//RingBufferTarget/event[@name="xml_deadlock_report"]') xdr(xdr) ORDER BY deadlock_time DESC;
-- Per-database lock timeout (milliseconds) -- Sessions abort after 30 seconds waiting for a lock ALTER DATABASE YourDB SET LOCK_TIMEOUT 30000; -- Enable RCSI: readers no longer block writers (and vice versa) -- Requires brief exclusive access — schedule during low traffic ALTER DATABASE YourDB SET READ_COMMITTED_SNAPSHOT ON; -- Verify SELECT name, is_read_committed_snapshot_on FROM sys.databases WHERE name = 'YourDB';
All SQL Server queries verified against SQL Server 2016, 2019, and 2022. The SUBSTRING logic for extracting the current statement from sys.dm_exec_requests is the standard pattern — it handles the statement_end_offset = -1 edge case (query is the last or only statement in the batch).
On RCSI: READ_COMMITTED_SNAPSHOT ON is one of the most impactful improvements you can make to SQL Server locking behaviour. The ALTER DATABASE command requires that no other connections are using the database at the moment it runs. In practice: put the database in single-user mode briefly, run the command, put it back. Test in a non-prod environment first — some applications rely on the old blocking behaviour and behave unexpectedly under RCSI.
SELECT r.trx_id AS waiting_trx, r.trx_mysql_thread_id AS waiting_thread, r.trx_query AS waiting_query, b.trx_id AS blocking_trx, b.trx_mysql_thread_id AS blocking_thread, b.trx_query AS blocking_query, TIMESTAMPDIFF(SECOND, b.trx_started, NOW()) AS blocking_age_sec, b.trx_state AS blocker_state FROM information_schema.innodb_lock_waits w JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id ORDER BY blocking_age_sec DESC; -- MySQL 8.0+: use performance_schema.data_lock_waits instead -- (information_schema.innodb_lock_waits removed in 8.0)
-- Full process list to confirm before killing SHOW FULL PROCESSLIST; -- Kill using blocking_thread from the query above KILL 1234; -- Confirm it's gone SELECT trx_id, trx_state, trx_query FROM information_schema.innodb_trx WHERE trx_mysql_thread_id = 1234; -- Should return 0 rows
-- Shows full InnoDB status including LATEST DETECTED DEADLOCK -- Look for the section starting "LATEST DETECTED DEADLOCK" -- It shows: each transaction's query, the locks held, which was rolled back SHOW ENGINE INNODB STATUS\G -- MySQL 8.0+: structured deadlock data SELECT * FROM performance_schema.events_errors_summary_global_by_error WHERE error_name = 'ER_LOCK_DEADLOCK';
-- Global: kill transactions waiting over 30 seconds SET GLOBAL innodb_lock_wait_timeout = 30; -- Reduce gap lock exposure: READ COMMITTED avoids most gap locks -- Note: loses REPEATABLE READ guarantees — test your app first SET SESSION transaction_isolation = 'READ-COMMITTED'; -- Verify deadlock detection is on (should be default ON) SHOW VARIABLES LIKE 'innodb_deadlock_detect';
The MySQL 5.7/8.0 split on information_schema vs performance_schema is correctly flagged — this is a common source of confusion when running old runbooks on MySQL 8.0+ instances. The SHOW ENGINE INNODB STATUS approach works on all versions.
On the isolation level change: Switching from REPEATABLE READ to READ COMMITTED is a significant application-level decision. It eliminates most gap locks (and thus many deadlocks), but it changes the visibility semantics for long-running transactions. Never set this globally on a production system without thorough application testing first. The per-session approach shown is correct.
SELECT w.sid AS waiting_sid, w.serial# AS waiting_serial, w.username AS waiting_user, w.event AS wait_event, w.seconds_in_wait AS wait_seconds, h.sid AS blocking_sid, h.serial# AS blocking_serial, h.username AS blocking_user, h.status AS blocking_status, SUBSTR(sq.sql_text, 1, 200) AS blocking_sql FROM v$session w JOIN v$session h ON h.sid = w.blocking_session LEFT JOIN v$sql sq ON sq.sql_id = h.sql_id WHERE w.blocking_session IS NOT NULL ORDER BY w.seconds_in_wait DESC;
-- Use SID and SERIAL# from the query above -- IMMEDIATE bypasses the graceful wait ALTER SYSTEM KILL SESSION '142,3891' IMMEDIATE; -- Check: status = KILLED means rollback is in progress (normal) SELECT sid, serial#, status, username FROM v$session WHERE sid = 142; -- If status stays KILLED for over 5 minutes: -- The OS process may need termination at the OS level -- Find the OS PID: SELECT spid FROM v$process p JOIN v$session s ON p.addr = s.paddr WHERE s.sid = 142; -- Then: kill -9 {spid} on Linux
Oracle writes a trace file to the ADR directory when a deadlock occurs, containing the full lock graph. Find the most recent ORA-00060 trace file.
-- Find trace directory SELECT value FROM v$parameter WHERE name = 'background_dump_dest'; -- List ORA-00060 (deadlock) incidents, most recent first SELECT incident_id, create_time, error_number, error_msg FROM v$diag_incident WHERE error_number = 60 ORDER BY create_time DESC FETCH FIRST 10 ROWS ONLY; -- The trace file will be in background_dump_dest -- Named: ora_{spid}_{instance}.trc -- Search it for "Deadlock graph"
-- Create a profile to kill idle sessions automatically CREATE PROFILE app_profile LIMIT IDLE_TIME 30 -- minutes before idle session killed CONNECT_TIME 480 -- 8-hour max session lifetime SESSIONS_PER_USER 100; -- prevent connection flooding ALTER USER app_user PROFILE app_profile; -- Verify the profile is active SELECT username, profile, account_status FROM dba_users WHERE username = 'APP_USER';
The Oracle section is accurate for 11g through 21c. One important addition: if ALTER SYSTEM KILL SESSION returns immediately but the status stays as KILLED for an extended period, it means Oracle is waiting for the session to respond before rolling back its work. This is expected for sessions with large open transactions. You can force immediate cleanup using ALTER SYSTEM DISCONNECT SESSION 'sid,serial' POST_TRANSACTION instead.
On IDLE_TIME in profiles: The value is in minutes, not seconds. A value of 30 means sessions idle for 30 minutes will be marked for termination by PMON. PMON runs on a sweep cycle, so the actual kill may happen a few minutes after the threshold is reached.