Saturday, 11:40. Every query on the production database was running at 3–10x normal duration. Read latency: 200ms across every data file. The storage array was at saturation. But no large queries were running. Active session count was normal. CPU at 8%.
The operations team had completed a scheduled data purge at 11:00 — deleting 90 million rows from a transaction history table. The delete finished cleanly. Forty minutes later, the I/O storm began.
The Alert
Monitoring showed sustained 15,000 IOPS — the exact rated maximum of the storage array — but sys.dm_exec_requests showed only 40 active sessions, none doing anything heavy. The I/O was coming from somewhere inside the engine, not from user queries.
First Hypothesis: Checkpoint Flooding
After a mass delete, dirty pages need flushing. We assumed checkpoint was the source. Checked three times, thirty seconds apart.
SELECT r.session_id, r.command, r.status, r.wait_type, r.reads, r.writes FROM sys.dm_exec_requests r WHERE r.command LIKE '%CHECKPOINT%' OR r.command LIKE '%GHOST%' OR r.command LIKE '%LAZY%';
No checkpoint running. No lazy writer. But one entry: command = GHOST CLEANUP. Reads climbing. We ran it again — ghost cleanup still there, reads up by 400,000. Third check: same process, reads up another 400,000. The ghost cleanup task was consuming every available IOP.
The Discovery
When SQL Server deletes rows from an index, it marks them as ghost records rather than immediately reclaiming page space. A background task — ghost cleanup — revisits those pages later to physically remove the records. On 90 million deleted rows across a clustered index with 7 nonclustered indexes, the ghost count was staggering.
SELECT OBJECT_NAME(ips.object_id) AS table_name, i.name AS index_name, SUM(ips.ghost_record_count) AS ghost_records, SUM(ips.page_count) AS total_pages FROM sys.dm_db_index_physical_stats( DB_ID(), OBJECT_ID('dbo.txn_history'), NULL, NULL, 'DETAILED' ) ips JOIN sys.indexes i ON i.object_id = ips.object_id AND i.index_id = ips.index_id GROUP BY ips.object_id, i.name ORDER BY ghost_records DESC;
To understand the scale: ghost cleanup is a single-threaded background process that walks each index's B-tree, finds pages containing ghost records, acquires a brief latch on each page, removes the ghost entries, and deallocates empty pages. On a well-indexed table with 7 nonclustered indexes, every row deletion creates one ghost per index — so 90 million deletes produced 90 million × 8 = 720 million ghost records initially. By the time we checked, cleanup had already processed 77 million, leaving 643 million.
The I/O pattern was particularly destructive. Ghost cleanup does not read pages sequentially — it walks the B-tree in logical order, which maps to random physical I/O on disk. Random reads on spinning disks are limited by seek time; on SSDs, they are limited by queue depth. Either way, ghost cleanup's pattern saturates the device's random IOPS capacity long before bandwidth is exhausted. Our storage showed 15,000 IOPS at only 180 MB/s — well below its sequential throughput of 2 GB/s.
We ran the count three times, five minutes apart. First: 643M. Second: 598M. Third: 551M. Ghost cleanup was working — removing 45 million every five minutes — but at the cost of saturating the storage array.
Incident Timeline
| Time | Event |
|---|---|
| 10:00 | Scheduled purge begins — DELETE WHERE created_date < '2023-01-01' |
| 11:02 | Delete completes. 90M rows removed. |
| 11:05 | Ghost cleanup activates. Begins visiting pages. |
| 11:40 | Read latency alert: 200ms across all data files |
| 12:00 | Ghost cleanup disabled via TF 661. Latency drops to 4ms in 90 seconds. |
| 12:30 | Ghost cleanup re-enabled. Completes over 6 hours at low priority. |
Root Cause
Mass delete of 90M rows created 643M ghost records across 8 indexes. Ghost cleanup is single-threaded and unthrottled by default. It performed random I/O across all index B-trees simultaneously, consuming the storage array's full 15,000 random IOPS. Every user query was starved of I/O.
The Fix
-- Disable ghost cleanup immediately DBCC TRACEON(661, -1); -- Verify latency dropping SELECT DB_NAME(database_id) AS db, io_stall_read_ms / NULLIF(num_of_reads, 0) AS avg_read_ms FROM sys.dm_io_virtual_file_stats(NULL, NULL); -- Re-enable during off-peak DBCC TRACEOFF(661, -1);
Prevention
Mass deletes now batched: 500,000 rows per transaction with a 10-second WAITFOR between batches. This gives ghost cleanup time to process each batch before the next arrives, spreading IOPS over hours instead of spiking in one burst. The batch approach also keeps the transaction log under control — a single 90-million-row delete generates a massive log record that can fill the transaction log; batches of 500,000 each generate manageable log entries that get truncated between batches.
Purges run in the 02:00–05:00 window. A monitoring query checks ghost record counts after every purge and alerts if the total exceeds 50 million. We also added a pre-purge checklist: verify the maintenance window is long enough, confirm no other I/O-heavy jobs are scheduled, and ensure the storage array has at least 40% IOPS headroom before starting the delete. with a 10-second WAITFOR between batches. This gives ghost cleanup time to process each batch, spreading IOPS over hours. Purges run in the 02:00–05:00 window. A monitoring query checks ghost record counts after every purge and alerts if the total exceeds 50 million.