Monday after a weekend upgrade from SQL Server 2014 to 2016. The application team reported "periodic freezes" — every 30 seconds, every query on the server stalled for exactly 10 seconds, then resumed. CPU graph showed sharp square-wave dips. No errors in the SQL error log. No blocking. No deadlocks. The pattern was metronome-precise.
The Alert
We captured wait stats during one of the 10-second freezes using a loop that sampled every second.
WHILE 1 = 1 BEGIN INSERT INTO #wait_capture SELECT GETDATE(), wait_type, waiting_tasks_count, wait_time_ms FROM sys.dm_os_wait_stats WHERE wait_type NOT IN ('SLEEP_TASK','WAITFOR','LAZYWRITER_SLEEP'); WAITFOR DELAY '00:00:01'; END;
During each freeze: WRITELOG waits spiked, PAGEIOLATCH_UP appeared, and all worker threads paused simultaneously. We captured three consecutive freeze cycles — all identical. WRITELOG spike at exactly 30-second intervals, lasting exactly 10 seconds. The precision ruled out application-level causes.
First Hypothesis: Disk I/O Bottleneck
WRITELOG suggested the log drive was stalling. We checked storage latency three times during freeze and non-freeze windows.
SELECT DB_NAME(vfs.database_id) AS db_name, mf.physical_name, vfs.io_stall_write_ms / NULLIF(vfs.num_of_writes, 0) AS avg_write_latency_ms, vfs.num_of_writes FROM sys.dm_io_virtual_file_stats(NULL, NULL) vfs JOIN sys.master_files mf ON mf.database_id = vfs.database_id AND mf.file_id = vfs.file_id WHERE mf.type_desc = 'LOG';
During non-freeze windows: write latency 0.4ms — excellent. During freeze windows: write latency jumped to 8,200ms. But the storage array showed no corresponding I/O spike. The writes were being held inside SQL Server, not delayed by the storage. We verified this three times: perfmon showed disk idle during the freeze, while SQL Server reported multi-second write stalls. The stall was inside the engine.
Second Hypothesis: Log Flush Contention
We suspected VLF fragmentation or auto-growth on the transaction log. Checked VLF count and auto-growth events three times.
VLF count: 12 — healthy. No auto-growth events in the past month. Log file pre-sized at 40GB with 8GB increments. This was not a log management issue.
The Discovery
We noticed the TARGET_RECOVERY_TIME on the primary database was set to 60 seconds — the new default in SQL Server 2016. In SQL Server 2014, the default was 0 (conventional checkpoint). SQL Server 2016 changed the default to 60 seconds, enabling indirect checkpoint automatically on upgrade.
SELECT name, target_recovery_time_in_seconds, log_reuse_wait_desc FROM sys.databases WHERE name = 'ProdDB'; -- Check checkpoint events in ring buffer SELECT TOP 20 record.value('(Record/@time)[1]', 'bigint') AS time_ms, record.value('(Record/Scheduler/@SchedulerID)[1]', 'int') AS scheduler FROM ( SELECT CAST(record AS XML) AS record FROM sys.dm_os_ring_buffers WHERE ring_buffer_type = 'RING_BUFFER_CHECKPOINT' ) x;
TARGET_RECOVERY_TIME: 60. We verified three times. This was a known SQL Server 2016 bug — the indirect checkpoint implementation had a defect where, under high write workloads, the checkpoint process would stall all log flushing while it computed dirty page targets. Microsoft documented this in KB4040276.
Incident Timeline
| Time | Event |
|---|---|
| Sat 22:00 | SQL Server upgraded from 2014 to 2016 |
| Mon 08:00 | Application team reports periodic 10-second freezes |
| Mon 09:30 | Square-wave pattern confirmed — 30-second cycle |
| Mon 11:00 | Disk ruled out — storage idle during freezes |
| Mon 13:00 | Indirect checkpoint identified as root cause |
| Mon 13:05 | TARGET_RECOVERY_TIME set to 0. Freezes stop immediately. |
Root Cause — Product Bug
This was a confirmed product bug in SQL Server 2016 RTM and early CUs — KB4040276. The indirect checkpoint algorithm, when computing dirty page flush targets on databases with high write throughput, acquired an internal latch that blocked all concurrent log flushes for the duration of its calculation. The calculation took up to 10 seconds on databases with large buffer pools. Microsoft fixed this in CU6 for SQL Server 2016 SP1.
The upgrade from 2014 to 2016 silently changed TARGET_RECOVERY_TIME from 0 to 60, activating the buggy code path. No entry in the error log. No warning during upgrade. The bug only manifested under sustained write workloads — exactly what this OLTP database ran 24/7.
The Fix
-- Immediate: revert to conventional checkpoint ALTER DATABASE ProdDB SET TARGET_RECOVERY_TIME = 0 SECONDS; -- Permanent: apply CU6+ which fixes the latch contention -- After patching, re-enable indirect checkpoint: ALTER DATABASE ProdDB SET TARGET_RECOVERY_TIME = 60 SECONDS; -- Verify freeze pattern is gone SELECT wait_type, waiting_tasks_count FROM sys.dm_os_wait_stats WHERE wait_type = 'WRITELOG';
Prevention
Every major version upgrade now includes a checklist of changed defaults. TARGET_RECOVERY_TIME is explicitly set during the upgrade script rather than inheriting the new default. Cumulative updates are applied within 30 days of upgrade, not deferred to the next maintenance window.
The 30-second monitoring loop that caught this pattern is now a permanent canary — any periodic stall triggers an immediate investigation. We built a lightweight agent that captures wait stats every second, computes the delta, and writes to a ring buffer. When any single wait type exceeds 5 seconds of accumulated wait in a 10-second window, it fires an alert with the wait type and a stack sample. This would have caught the indirect checkpoint stall on the first occurrence rather than waiting until Monday morning when the application team noticed.
The broader lesson from this incident is that database engine upgrades are not just schema compatibility checks. Internal algorithms change between versions — checkpoint behavior, cardinality estimation, locking strategies, memory management. A successful upgrade validates not just that queries return correct results, but that the engine's background processes behave within acceptable performance bounds under production workload. TARGET_RECOVERY_TIME is explicitly set during the upgrade script rather than inheriting the new default. Cumulative updates are applied within 30 days of upgrade, not deferred to the next maintenance window. The 30-second monitoring loop that caught this is now a permanent canary — any periodic stall pattern triggers an immediate investigation.