For three days, the afternoon batch run slowed to a crawl. Every day at 14:00 when the reconciliation process kicked in, 200+ sessions piled up behind a blocker. By 14:30 the blocking cleared on its own as sessions timed out and retried. Nobody could find the root cause because by the time anyone investigated, the chain had dissolved.
The Alert
On day four, we set up a monitoring loop to capture the blocking chain the moment it formed.
SELECT r.session_id AS blocked_spid, r.blocking_session_id AS blocker_spid, r.wait_type, r.wait_time / 1000 AS wait_sec, s.login_name, s.host_name, s.program_name, s.status, DATEDIFF(DAY, s.login_time, GETDATE()) AS session_age_days FROM sys.dm_exec_requests r JOIN sys.dm_exec_sessions s ON s.session_id = r.blocking_session_id WHERE r.blocking_session_id > 0 ORDER BY r.wait_time DESC;
Caught it. Every blocked session pointed to SPID 87 as the root blocker. SPID 87's status: sleeping. Session age: 11 days. Login: a service account for the linked-server integration. We ran the query three more times over 60 seconds: SPID 87 was always the root, always sleeping, always 11 days old. This session had been sitting there since the previous Tuesday.
First Hypothesis: Idle Session with Open Transaction
A sleeping session blocking others usually means an uncommitted transaction. We checked.
DBCC OPENTRAN; SELECT s.session_id, s.status, s.login_time, t.transaction_id, t.name AS txn_name, t.transaction_begin_time, DATEDIFF(DAY, t.transaction_begin_time, GETDATE()) AS txn_age_days, dt.database_transaction_log_bytes_reserved FROM sys.dm_tran_session_transactions st JOIN sys.dm_exec_sessions s ON s.session_id = st.session_id JOIN sys.dm_tran_active_transactions t ON t.transaction_id = st.transaction_id LEFT JOIN sys.dm_tran_database_transactions dt ON dt.transaction_id = t.transaction_id WHERE s.session_id = 87;
Transaction name: user_transaction. Transaction begin time: 11 days ago. Log bytes reserved: 84KB. The transaction had started 11 days ago, done almost nothing, and never committed. We verified three times — same transaction, same timestamp, same 11-day age. But then we saw the transaction type: it was enlisted in a distributed transaction. This was a DTC (Distributed Transaction Coordinator) transaction.
The Discovery
The linked-server integration ran a distributed query every night that pulled data from a remote Oracle instance. Eleven days ago, the Oracle side had thrown an ORA-12571 (TNS:packet writer failure) mid-transaction. The remote side rolled back. But the local SQL Server side never received the abort notification — the DTC transaction remained open, holding an exclusive IX lock on the target table. The lock escalated to a table-level IX lock because the original query had touched more than 5,000 rows.
We confirmed by checking sys.dm_tran_active_transactions — the transaction type was 2 (distributed). We also checked the DTC admin console, which showed the transaction in "In Doubt" state. Verified all three findings three times: distributed type, In Doubt status, 11-day age.
Incident Timeline
| Time | Event |
|---|---|
| Day 1, 02:00 | Linked server query starts distributed transaction |
| Day 1, 02:03 | Oracle side throws ORA-12571. Remote rollback. |
| Day 1, 02:03 | SQL Server DTC transaction enters "In Doubt" state. Lock held. |
| Day 2–4 | Afternoon blocking appears and self-resolves via timeouts |
| Day 4, 14:05 | Monitoring captures SPID 87 as root blocker. 11 days old. |
| Day 4, 14:10 | KILL 87. DTC transaction resolved. Blocking clears. |
Root Cause
A network failure between SQL Server and a linked Oracle instance left a distributed transaction in "In Doubt" state. The DTC protocol requires both sides to agree on commit or rollback — when the Oracle side became unreachable, the local transaction stayed open indefinitely. SQL Server's default behavior is to hold locks on In Doubt transactions until manual resolution, because committing or rolling back unilaterally could cause data inconsistency.
The Fix
-- Kill the orphaned session KILL 87; -- Configure DTC to auto-resolve after timeout -- In Component Services > DTC > Properties > Transaction Manager: -- Set "Transaction Timeout" to 300 seconds (5 minutes) -- Monitor for In Doubt transactions SELECT transaction_id, name, transaction_begin_time, DATEDIFF(HOUR, transaction_begin_time, GETDATE()) AS age_hours FROM sys.dm_tran_active_transactions WHERE transaction_type = 2 AND DATEDIFF(HOUR, transaction_begin_time, GETDATE()) > 1;
Prevention
DTC transaction timeout is now set to 300 seconds. A nightly monitoring job queries sys.dm_tran_active_transactions for distributed transactions older than 1 hour and alerts immediately. The linked-server integration now uses explicit TRY/CATCH with a rollback on failure, and the linked server timeout is set to 60 seconds rather than the default of 0 (infinite).
We also removed the distributed transaction entirely. The integration now uses a message queue pattern: the SQL Server side writes to a staging table, a service reads from the staging table and writes to Oracle separately, and a reconciliation check verifies both sides match. This eliminates the two-phase commit dependency and makes the integration tolerant of Oracle outages. If Oracle is unreachable, the staging table queues rows until connectivity returns — no orphaned transactions, no locks held across an unreliable network link.
The 11-day lock went undetected because the blocking only manifested during the afternoon batch — the rest of the day, queries that needed the locked table happened to use different partitions or took optimistic read paths. Intermittent, time-of-day-dependent blocking is the hardest pattern to diagnose because it self-resolves before anyone looks. The monitoring loop we deployed — capturing blocking chains every 30 seconds into a persistent table — would have caught this on day one instead of day four. A nightly monitoring job queries sys.dm_tran_active_transactions for distributed transactions older than 1 hour and alerts immediately. The linked-server integration now uses explicit TRY/CATCH with a rollback on failure, and the linked server timeout is set to 60 seconds rather than the default of 0 (infinite).