Monday morning, 08:17. Support tickets opened simultaneously across three enterprise accounts. The same stored procedure — usp_GetCustomerOrders — that had run in under 100ms on Friday was now timing out after 22 seconds in production. Dev showed 40ms. Staging showed 60ms. Production: 22 seconds. Nothing had deployed over the weekend.
The Alert
Application monitoring fired at 08:17:34 — stored procedure timeout errors on usp_GetCustomerOrders. CPU on the SQL Server was elevated at 68%. Memory pressure was normal. The procedure had not changed. No index changes. No statistics updates logged over the weekend. The on-call engineer opened the execution plan in production SSMS and saw something that did not match what dev showed at all.
First Hypothesis: Statistics Gone Stale
Stale statistics causing a bad cardinality estimate was the first assumption. Plausible — a weekend batch job could have shifted the data distribution. The fix would be straightforward: update stats and flush the plan.
SELECT s.name, s.last_updated, s.rows, s.rows_sampled FROM sys.dm_db_stats_properties(OBJECT_ID('dbo.Orders'), 1) s;
Statistics had updated Saturday night via the maintenance job. They were current. Hypothesis one ruled out.
Second Hypothesis: Plan Cache Corruption
Plan cache corruption after a restart or memory pressure event. We checked the plan cache directly.
SELECT cp.usecounts, cp.size_in_bytes, qp.query_plan, qs.total_elapsed_time / qs.execution_count AS avg_elapsed_us, qs.total_logical_reads / qs.execution_count AS avg_reads FROM sys.dm_exec_cached_plans cp CROSS APPLY sys.dm_exec_sql_text(cp.plan_handle) st CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) qp JOIN sys.dm_exec_query_stats qs ON qs.plan_handle = cp.plan_handle WHERE st.text LIKE '%usp_GetCustomerOrders%';
Two cached plans for the same procedure. Average reads on one: 847. On the other: 4,200,000. Same procedure. Two wildly different plans. The cache was not corrupted — it was split.
The Discovery
Two plans for one procedure meant the procedure was being called with different SET options from different connections. The plan compiled for one SET option configuration was being used only by connections with that exact configuration. The 22-second plan had been compiled first — on Monday morning, by the application connection pool — and that plan was built around the first execution's parameter values.
SELECT cp.plan_handle, cp.usecounts, -- Decode the set_options bitmask CASE WHEN (qs.statement_sql_handle IS NOT NULL) THEN 'Has statement handle' END, cp.objtype, qp.query_plan.value('(//ParameterList/ColumnReference/@ParameterCompiledValue)[1]', 'nvarchar(100)') AS first_param_compiled FROM sys.dm_exec_cached_plans cp CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) qp JOIN sys.dm_exec_query_stats qs ON qs.plan_handle = cp.plan_handle CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st WHERE st.objectid = OBJECT_ID('dbo.usp_GetCustomerOrders');
The compiled parameter value on the bad plan: @CustomerId = 1 — the internal test account, with 4.2 million orders. Every real customer had between 12 and 400 orders. The plan optimised for 4.2 million rows was being used for queries returning 50 rows.
Incident Timeline
| Time | Event |
|---|---|
| 08:00:01 | Application connection pool restarts — all cached plans cleared |
| 08:00:03 | First call to usp_GetCustomerOrders — parameter: @CustomerId = 1 (internal test) |
| 08:00:03 | Plan compiled for 4.2M row dataset. Clustered index scan. Nested loops. |
| 08:00:04 | All subsequent calls use the scan plan. 22 second execution per call. |
| 08:17:34 | Application timeout errors. Support tickets open. |
| 08:31:00 | On-call engineer identifies two plans in cache |
| 08:44:00 | Root cause confirmed — sniffed parameter @CustomerId=1 |
| 08:46:00 | EXEC sp_recompile applied. Bad plan flushed. |
| 08:47:00 | Procedure recompiles on next call with @CustomerId=10047. Plan optimised for real data. |
| 08:47:30 | p99 returns to 80ms. Incident resolved. |
Root Cause
SQL Server compiled the execution plan on the first call after the Monday morning application pool restart. That first call came from an internal test account — customer ID 1 — which has 4.2 million orders. The plan SQL Server built assumed every call would process millions of rows: clustered index scans, large memory grants, hash joins. Every subsequent call from real customers — with 12 to 400 orders — used that same plan. The scan that took 22 seconds on a 4.2M row dataset ran identically on a 50-row dataset.
Parameter sniffing is correct behaviour. SQL Server builds the best plan for the first parameters it sees. The problem was that the first parameters were statistically impossible for any real user.
SSMS has ARITHABORT ON by default. Most application drivers have ARITHABORT OFF. SQL Server treats these as different session contexts and caches separate plans. The DBA running the query in SSMS got the fast plan. The application driver got the slow one.
The Fix
-- Force recompile on next execution EXEC sp_recompile 'dbo.usp_GetCustomerOrders'; -- Verify the bad plan is gone SELECT cp.usecounts, qp.query_plan FROM sys.dm_exec_cached_plans cp CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) qp CROSS APPLY sys.dm_exec_sql_text(cp.plan_handle) st WHERE st.objectid = OBJECT_ID('dbo.usp_GetCustomerOrders');
ALTER PROCEDURE dbo.usp_GetCustomerOrders @CustomerId INT AS SELECT o.OrderId, o.OrderDate, o.Total FROM dbo.Orders o WHERE o.CustomerId = @CustomerId OPTION (OPTIMIZE FOR (@CustomerId UNKNOWN)); -- Forces SQL Server to use average statistics rather than sniffed parameter
Prevention
Two permanent changes. The test account (customer ID 1) was excluded from application pool warmup calls. And monitoring was added to alert on plan count per procedure — any procedure with more than 2 cached plans triggers an alert for investigation.
SELECT OBJECT_NAME(st.objectid) AS proc_name, COUNT(*) AS plan_count, SUM(cp.usecounts) AS total_executions, MAX(qs.total_elapsed_time / qs.execution_count) / 1000 AS max_avg_ms FROM sys.dm_exec_cached_plans cp CROSS APPLY sys.dm_exec_sql_text(cp.plan_handle) st JOIN sys.dm_exec_query_stats qs ON qs.plan_handle = cp.plan_handle WHERE cp.objtype = 'Proc' GROUP BY st.objectid HAVING COUNT(*) > 2 ORDER BY plan_count DESC;