SQL Server 2016 Froze Every 30 Seconds

Monday after a weekend upgrade from SQL Server 2014 to 2016. The application team reported "periodic freezes" — every 30 seconds, every query on the server stalled for exactly 10 seconds, then resumed. CPU graph showed sharp square-wave dips. No errors in the SQL error log. No blocking. No deadlocks. The pattern was metronome-precise.

The Alert

We captured wait stats during one of the 10-second freezes using a loop that sampled every second.

Capture waits during the freeze window

SQL Server

WHILE 1 = 1
BEGIN
  INSERT INTO #wait_capture
  SELECT GETDATE(), wait_type, waiting_tasks_count, wait_time_ms
  FROM sys.dm_os_wait_stats
  WHERE wait_type NOT IN ('SLEEP_TASK','WAITFOR','LAZYWRITER_SLEEP');
  WAITFOR DELAY '00:00:01';
END;

During each freeze: WRITELOG waits spiked, PAGEIOLATCH_UP appeared, and all worker threads paused simultaneously. We captured three consecutive freeze cycles — all identical. WRITELOG spike at exactly 30-second intervals, lasting exactly 10 seconds. The precision ruled out application-level causes.

First Hypothesis: Disk I/O Bottleneck

WRITELOG suggested the log drive was stalling. We checked storage latency three times during freeze and non-freeze windows.

Check log file latency

SQL Server

SELECT
  DB_NAME(vfs.database_id) AS db_name,
  mf.physical_name,
  vfs.io_stall_write_ms / NULLIF(vfs.num_of_writes, 0) AS avg_write_latency_ms,
  vfs.num_of_writes
FROM sys.dm_io_virtual_file_stats(NULL, NULL) vfs
JOIN sys.master_files mf ON mf.database_id = vfs.database_id
  AND mf.file_id = vfs.file_id
WHERE mf.type_desc = 'LOG';

During non-freeze windows: write latency 0.4ms — excellent. During freeze windows: write latency jumped to 8,200ms. But the storage array showed no corresponding I/O spike. The writes were being held inside SQL Server, not delayed by the storage. We verified this three times: perfmon showed disk idle during the freeze, while SQL Server reported multi-second write stalls. The stall was inside the engine.

Second Hypothesis: Log Flush Contention

We suspected VLF fragmentation or auto-growth on the transaction log. Checked VLF count and auto-growth events three times.

VLF count: 12 — healthy. No auto-growth events in the past month. Log file pre-sized at 40GB with 8GB increments. This was not a log management issue.

The Discovery

We noticed the TARGET_RECOVERY_TIME on the primary database was set to 60 seconds — the new default in SQL Server 2016. In SQL Server 2014, the default was 0 (conventional checkpoint). SQL Server 2016 changed the default to 60 seconds, enabling indirect checkpoint automatically on upgrade.

Check indirect checkpoint configuration

SQL Server

SELECT
  name, target_recovery_time_in_seconds,
  log_reuse_wait_desc
FROM sys.databases
WHERE name = 'ProdDB';

-- Check checkpoint events in ring buffer
SELECT TOP 20
  record.value('(Record/@time)[1]', 'bigint') AS time_ms,
  record.value('(Record/Scheduler/@SchedulerID)[1]', 'int') AS scheduler
FROM (
  SELECT CAST(record AS XML) AS record
  FROM sys.dm_os_ring_buffers
  WHERE ring_buffer_type = 'RING_BUFFER_CHECKPOINT'
) x;

TARGET_RECOVERY_TIME: 60. We verified three times. This was a known SQL Server 2016 bug — the indirect checkpoint implementation had a defect where, under high write workloads, the checkpoint process would stall all log flushing while it computed dirty page targets. Microsoft documented this in KB4040276.

Incident Timeline

Time	Event
Sat 22:00	SQL Server upgraded from 2014 to 2016
Mon 08:00	Application team reports periodic 10-second freezes
Mon 09:30	Square-wave pattern confirmed — 30-second cycle
Mon 11:00	Disk ruled out — storage idle during freezes
Mon 13:00	Indirect checkpoint identified as root cause
Mon 13:05	TARGET_RECOVERY_TIME set to 0. Freezes stop immediately.

Root Cause — Product Bug

This was a confirmed product bug in SQL Server 2016 RTM and early CUs — KB4040276. The indirect checkpoint algorithm, when computing dirty page flush targets on databases with high write throughput, acquired an internal latch that blocked all concurrent log flushes for the duration of its calculation. The calculation took up to 10 seconds on databases with large buffer pools. Microsoft fixed this in CU6 for SQL Server 2016 SP1.

The upgrade from 2014 to 2016 silently changed TARGET_RECOVERY_TIME from 0 to 60, activating the buggy code path. No entry in the error log. No warning during upgrade. The bug only manifested under sustained write workloads — exactly what this OLTP database ran 24/7.

The Fix

Disable indirect checkpoint and apply the CU

SQL Server

-- Immediate: revert to conventional checkpoint
ALTER DATABASE ProdDB SET TARGET_RECOVERY_TIME = 0 SECONDS;

-- Permanent: apply CU6+ which fixes the latch contention
-- After patching, re-enable indirect checkpoint:
ALTER DATABASE ProdDB SET TARGET_RECOVERY_TIME = 60 SECONDS;

-- Verify freeze pattern is gone
SELECT wait_type, waiting_tasks_count
FROM sys.dm_os_wait_stats
WHERE wait_type = 'WRITELOG';

Prevention

Every major version upgrade now includes a checklist of changed defaults. TARGET_RECOVERY_TIME is explicitly set during the upgrade script rather than inheriting the new default. Cumulative updates are applied within 30 days of upgrade, not deferred to the next maintenance window.

The 30-second monitoring loop that caught this pattern is now a permanent canary — any periodic stall triggers an immediate investigation. We built a lightweight agent that captures wait stats every second, computes the delta, and writes to a ring buffer. When any single wait type exceeds 5 seconds of accumulated wait in a 10-second window, it fires an alert with the wait type and a stack sample. This would have caught the indirect checkpoint stall on the first occurrence rather than waiting until Monday morning when the application team noticed.

The broader lesson from this incident is that database engine upgrades are not just schema compatibility checks. Internal algorithms change between versions — checkpoint behavior, cardinality estimation, locking strategies, memory management. A successful upgrade validates not just that queries return correct results, but that the engine's background processes behave within acceptable performance bounds under production workload. TARGET_RECOVERY_TIME is explicitly set during the upgrade script rather than inheriting the new default. Cumulative updates are applied within 30 days of upgrade, not deferred to the next maintenance window. The 30-second monitoring loop that caught this is now a permanent canary — any periodic stall pattern triggers an immediate investigation.

SQL Server 2016 Froze Every 30 Seconds for Exactly 10 Seconds

The Alert

First Hypothesis: Disk I/O Bottleneck

Second Hypothesis: Log Flush Contention

The Discovery

Incident Timeline

Root Cause — Product Bug

The Fix

Prevention

The Same Problem on Other Engines