All Stories — QueryTuning

MySQL Query PerformancePost-Mortem NEW

The Subquery That Returned Correct Results in 5.7 and Wrong Results in 8.0

After upgrading from MySQL 5.7 to 8.0, a revenue report showed totals 23% higher than the payment processor. Same query, same data, different numbers. The optimizer's derived_merge silently eliminated a DISTINCT during subquery flattening — a confirmed product bug.

⏱ 11 minMySQL 8.0Jun 10, 2026

Read →

Oracle LockingPost-Mortem

DBMS_STATS Locked 200 Sessions Out for 8 Minutes

The nightly stats job overran into morning peak. Gathering statistics on a 400-partition table acquired exclusive dictionary locks that blocked every INSERT and UPDATE, cascading to 200 waiting sessions.

⏱ 10 minOracle 11g+Jun 10, 2026

Read →

Oracle Disk & I/OPost-Mortem

ORA-01652 Every Monday at 9:05 AM for Seven Weeks

Temp tablespace filled every Monday. The DBA added space each time. Seven weeks later, temp was 200GB. DEGREE DEFAULT was spawning 64 parallel slaves, each building its own hash table in temp.

⏱ 10 minOracle 11g+Jun 10, 2026

Read →

Oracle Query PerformancePost-Mortem

The Adaptive Plan That Chose Hash Join for 7 Rows

After upgrading to 12c, a core lookup randomly slowed from 2ms to 14 seconds. Adaptive plans were flipping join methods based on stale cardinality feedback from one outlier execution.

⏱ 10 minOracle 12c+Jun 10, 2026

Read →

Oracle CPUPost-Mortem

Result Cache Turned a 3-Second Query into a 45-Minute Latch Storm

A RESULT_CACHE hint — 150x faster in test. Under 200 concurrent sessions in production, every CPU core spun on a single latch. A confirmed Oracle 12c product bug.

⏱ 11 minOracle 12cJun 10, 2026

Read →

SQL Server LockingPost-Mortem

An Orphaned DTC Transaction Held a Lock for 11 Days

A linked-server query to Oracle failed mid-transaction. The remote side rolled back, but the local DTC transaction stayed open with an exclusive lock on a 400M-row table for 11 days.

⏱ 11 minSQL Server 2014+Jun 10, 2026

Read →

SQL Server Query PerformancePost-Mortem

Auto Stats Update Triggered 2,000 Recompilations in 90 Seconds

Synchronous auto stats fired on a hot table during ETL. Every cached plan was invalidated simultaneously — 2,000 recompilations in 90 seconds spiked CPU to 100%.

⏱ 11 minSQL Server 2016+Jun 10, 2026

Read →

SQL Server CPUPost-Mortem

SQL Server 2016 Froze Every 30 Seconds for Exactly 10 Seconds

After upgrading from 2014 to 2016, metronome-precise freezes. Storage was idle during the stalls. A confirmed product bug (KB4040276) in indirect checkpoint.

⏱ 10 minSQL Server 2016Jun 10, 2026

Read →

SQL Server Disk & I/OPost-Mortem

Ghost Cleanup Consumed All Disk IOPS After a 90M Row Delete

After a 90-million-row delete, read latency jumped to 200ms. No active queries writing. Ghost cleanup was reclaiming pages across 8 indexes at 15,000 IOPS — the full storage capacity.

⏱ 11 minSQL Server 2014+Jun 10, 2026

Read →

SQL Server MemoryPost-Mortem

Queries Waited 47 Minutes for Memory Grants That Never Came

340 sessions queued behind a single report granted 14GB and using 3GB. A 21x cardinality overestimate from 45-day-old statistics locked 58% of the workspace memory pool.

⏱ 12 minSQL Server 2016+Jun 10, 2026

Read →

SQL Server TempDBPost-Mortem

TempDB Allocation Contention Stalled Our Entire OLTP Workload — One Configuration Line Fixed It

Twelve stored procedures running concurrently triggered PAGELATCH_EX waits on TempDB allocation pages. CPU was at 6%, memory was fine, no blocking chains — the standard monitoring showed nothing. The fix was adding four TempDB data files. The root cause was a default SQL Server installation we had never questioned.

⏱ 7 minSQL Server 2014+Apr 16, 2026

Read →

SQL Server Query Performance Plan Cache

The SET Options Your Application Sets Without Telling You — and How They Silently Break Query Performance

Your DBA runs a query in SSMS in 200ms. The application runs the same SQL in 8 seconds. Same parameters, same server — different plan. The culprit is almost always ARITHABORT OFF on your application driver causing a plan cache split. Full diagnosis path: reading the bitmask in sys.dm_exec_cached_plans, finding split entries, the filtered index silent bypass, and the fix without touching application code.

⏱ 13 minSQL Server 2016+Apr 7, 2026

Read →

SQL Server Blocking Post-Mortem

58 Seconds. That's How Long Our Checkout Was Completely Frozen on Black Friday

One SPID. One uncommitted transaction. One developer who opened SSMS on production to 'just quickly check something.' The full post-mortem of a 58-second checkout freeze on the highest-traffic day of the year — the blocking chain that revealed it, why the lock held, and the idle-in-transaction timeout we now enforce on every production database.

⏱ 11 minSQL Server 2014+Mar 29, 2026

Read →

SQL Server Performance Parameter Sniffing

Same Stored Procedure. 40ms on Dev. 22 Seconds on Prod. Here's Exactly Why.

The stored procedure was identical on both environments. The index was identical. The execution plan was different. One environment had compiled the plan with a common parameter value — the other had cached a plan compiled for a rare edge-case value that forced a different join strategy. The full diagnosis, when OPTION(RECOMPILE) is the right answer, and when it isn't.

⏱ 11 minSQL Server 2014+Mar 25, 2026

Read →

SQL Server Concurrency RCSI

We Enabled Read-Committed Snapshot Isolation on a 2TB Live Database with Four Minutes of Impact

The documentation says enabling RCSI requires exclusive database access. Our database never had zero connections. Here is every step of the migration: how we staged it, how we monitored the version store growth during the switchover, what we tested first, and how four minutes of controlled impact affected 800 concurrent users less than we feared.

⏱ 13 minSQL Server 2016+Mar 22, 2026

Read →

SQL Server TempDB I/O

A Single Sort Query Wrote 400GB to TempDB in 12 Minutes and Took Down Reporting

The query had an ORDER BY. The execution plan showed a Sort operator with no memory grant warning in SSMS — because SSMS tested it with a small result set. In production, the sort received 4GB of data, spilled immediately to TempDB, and wrote 400GB of sort runs over 12 minutes. The sys.dm_io_virtual_file_stats query that caught it and the fix.

⏱ 11 minSQL Server 2016+Mar 18, 2026

Read →

SQL Server CPU Runbook

SQL Server CPU at 100%: The First Ten Queries to Run When the Alert Fires

Systematic runbook for CPU saturation in SQL Server. In order of likelihood: rogue query consuming CPU in a tight loop, missing index causing repeated table scans, blocking cascade where waiting sessions consume CPU spinning, runaway Agent job, or sudden connection surge. This walks all five paths with the exact DMV queries for each, in the order most likely to find your specific problem first.

⏱ 9 minSQL Server 2014+Mar 15, 2026

Read →

SQL Server Indexing Post-Mortem

We Had 47 Indexes on One Table. None of Them Were the Right One.

Years of automated index recommendations from sys.dm_db_missing_index_details had produced a table with 47 indexes and 800ms write latency. The audit process that revealed which indexes had zero reads in 90 days, which were redundant prefixes of wider indexes, and how we got from 47 to 12 — including the one index we had to add that DMVs had never suggested.

⏱ 14 minSQL Server 2014+Mar 11, 2026

Read →

SQL Server Deadlock Extended Events

Deadlock Between Two Stored Procedures — Diagnosed Using the Extended Events Deadlock Graph

Two stored procedures, both updating the orders and inventory tables, but in opposite order. Classic deadlock. The Extended Events system_health ring buffer captured the full XML deadlock graph. This walks through reading the graph, identifying the resource nodes, understanding which transaction was chosen as the victim and why, and the single code change that eliminated the deadlock permanently.

⏱ 12 minSQL Server 2012+Mar 5, 2026

Read →

SQL Server Locking Lock Escalation

Lock Escalation Was Quietly Converting Our Row Locks to Table Locks on Every Bulk Update

SQL Server automatically escalates from row locks to a single table lock when a transaction acquires more than 5,000 row locks. Our bulk update touched 6,000+ rows, triggered escalation, and blocked every read on the table for the duration. Why this happens, how to detect it in sys.dm_tran_locks, and the three ways to prevent it — including the one that doesn't require changing your query.

⏱ 10 minSQL Server 2008+Feb 28, 2026

Read →

PostgreSQL Performance Plan Regression

Why Your PostgreSQL Query Plan Changed Overnight — and How to Pin It Back

Nobody touched the schema. Nobody changed the query. But the planner picked a different execution path and p99 latency went from 18ms to 4.2 seconds overnight. This almost always comes down to one of three root causes: stale statistics, histogram skew on a key column, or a new index reshaping the planner's option set. Three diagnostic paths with exact queries at each step.

⏱ 12 minPostgreSQL 12+Apr 1, 2026

Read →

PostgreSQL Autovacuum IOPS

The Autovacuum That Quietly Ate Our Entire IOPS Budget for Six Weeks

Every night between 3 and 4 AM, database IOPS saturated and query latency spiked. By 5 AM it was gone with no error in the logs. Six weeks of investigation revealed autovacuum running with a cost_delay setting tuned for spinning disk on an NVMe SSD — throttled just enough to run for hours while barely making progress, consuming all available I/O the entire time.

⏱ 14 minPostgreSQL 10+Apr 2, 2026

Read →

PostgreSQL Replication WAL

The Replication Slot That Filled the Disk While We Watched the Wrong Dashboards

We were monitoring active replication senders — connection count, lag in bytes, lag in seconds. The idle slot that had been accumulating WAL for six days appeared on none of those dashboards. pg_replication_slots told a different story: one slot, zero active connections, 47GB of retained WAL and growing. The monitoring blind spot and how to close it permanently.

⏱ 11 minPostgreSQL 10+Mar 28, 2026

Read →

PostgreSQL Memory OOM

work_mem Set Too High: How We Triggered an OOM Kill on Every Reporting Query

Setting work_mem to 1GB for reporting queries felt conservative. Then three simultaneous reports each spawned 12 sort nodes. The per-process memory calculation nobody does before setting this parameter globally — and why the fix was ALTER ROLE reporter SET work_mem = '64MB' rather than touching the global setting that was affecting every session on the server.

⏱ 8 minPostgreSQL 9.4+Mar 21, 2026

Read →

PostgreSQL Connections PgBouncer

2,000 Idle Connections Were Consuming an Entire CPU Core Before We Found It

Each idle PostgreSQL connection holds a backend process. At 2,000 connections, Linux process scheduling overhead alone was consuming a full CPU core and adding 30ms of latency to every query. The diagnosis query, why raising max_connections made things worse, and the PgBouncer transaction-mode configuration that reduced effective connections from 2,000 to 40.

⏱ 10 minPostgreSQL 9.6+Mar 14, 2026

Read →

PostgreSQL WAL Archiving Disk

Our archive_command Was Silently Returning 0 While Failing — WAL Filled the Disk in Six Days

The archive_command shell wrapper was catching rsync errors and returning exit code 0 to PostgreSQL. PostgreSQL believed archiving was succeeding. WAL segments accumulated in pg_wal for six days until the disk hit 100%. The pg_stat_archiver query that would have caught this on day one, and the monitoring alert that now runs every 5 minutes.

⏱ 9 minPostgreSQL 9.4+Mar 7, 2026

Read →

PostgreSQL Indexing Partial Index

The Partial Index That Fixed a Reporting Query Nobody Could Crack for Two Years

A 200-million-row orders table, a filter on status = 'active' — which covered 0.3% of rows. A standard B-tree index on (status, created_at) was useless because the planner estimated it would need to scan too many index pages to reach the active rows. A partial index with WHERE status = 'active' contained only those 600,000 rows and dropped query time from 40 seconds to 180ms.

⏱ 9 minPostgreSQL 9.5+Feb 28, 2026

Read →

PostgreSQL Indexing Post-Mortem

The Index That Made Our Queries Slower — and Why the Planner Chose It Anyway

A composite index on (user_id, created_at, status) looked correct — it covered all the WHERE clause columns. But the planner chose it over a smaller partial index covering only active rows because the statistics made it appear cheaper. The full explanation of why the planner was wrong, how to read pg_stats to understand the bad estimate, and the fix.

⏱ 12 minPostgreSQL 10+Feb 21, 2026

Read →

PostgreSQL Vacuum Bloat

Table Bloat After a Mass Delete: What PostgreSQL Does with Your Disk Space

We deleted 70% of a 500GB table. Disk usage barely changed. PostgreSQL marks deleted rows as dead tuples but does not return the space to the OS immediately — and a standard VACUUM only makes the space available for reuse within the table. The explanation of why this is correct behaviour, when to use VACUUM FULL, and why pg_repack is almost always the better answer.

⏱ 10 minPostgreSQL 9.5+Feb 14, 2026

Read →

MySQL Replication Post-Mortem

Our Read Replica Was 14 Hours Behind — and No Alert Fired

Seconds_Behind_Master read 0 in SHOW SLAVE STATUS. But the replica was 14 hours behind. The metric lied because a stalled SQL thread and an idle I/O thread produce a 0 lag reading even when no replication is happening. The pt-heartbeat approach to accurate lag measurement, the alert threshold that should have fired, and how a single large transaction caused the delay.

⏱ 10 minMySQL 5.7+ / MariaDB 10Mar 26, 2026

Read →

MySQL Deadlock Gap Locks

InnoDB Gap Locks and the Deadlock Between Rows That Didn't Exist Yet

Two transactions. Different target rows. Neither was modifying existing data the other needed. Still deadlocked. REPEATABLE READ isolation level and a range query in one transaction caused a gap lock on the range where the other transaction wanted to insert. The SHOW ENGINE INNODB STATUS deadlock graph, exactly which gap lock was held, and the isolation level change that eliminated it.

⏱ 10 minMySQL 5.6+ / MariaDB 10Mar 19, 2026

Read →

MySQL Connections Post-Mortem

Too Many Connections: Why Raising max_connections Made Our Outage Worse

We hit max_connections=151 and raised it to 500. Within 60 seconds, 500 connections were open and performance was worse than before. The hidden cost of MySQL connections — each holds memory, file descriptors, and thread resources — and why connection pooling with ProxySQL or MySQL Router, not a higher max_connections, is the correct solution.

⏱ 9 minMySQL 5.7+Mar 12, 2026

Read →

MySQL Performance Slow Query Log

The Slow Query Log Saved Our Migration. Here's How to Actually Read It.

We enabled the slow query log during a schema migration and found three queries we didn't know existed consuming 80% of database time. How to enable it without performance impact, what the critical fields mean (Query_time vs Lock_time vs Rows_examined), and how pt-query-digest groups and ranks 10,000 slow query events into an actionable report in under a minute.

⏱ 8 minMySQL 5.6+ / MariaDB 10Mar 5, 2026

Read →

MySQL Memory InnoDB

InnoDB Buffer Pool Was 4GB on a Server with 64GB RAM and We Wondered Why Reads Were Slow

The default InnoDB buffer pool is 128MB. We had set it to 4GB — which seemed generous until we checked how large our working dataset actually was. The queries to measure your actual working set size, how to interpret the buffer pool hit rate, why the hit rate metric alone is misleading, and the rule for sizing the buffer pool correctly.

⏱ 8 minMySQL 5.7+ / MariaDB 10Feb 26, 2026

Read →

MySQL Replication Filters

Replica Filters Were Silently Dropping Rows and We Didn't Notice for Three Weeks

REPLICATE_IGNORE_TABLE was configured on our replica to skip an audit log table. When the application started writing to a new table with a similar naming pattern, the filter matched it and dropped all replication events for that table. The replica silently diverged from primary for three weeks. How we detected the divergence with pt-table-checksum and how we recovered.

⏱ 11 minMySQL 5.6+ / MariaDB 10Feb 19, 2026

Read →

MySQL Binary Log Disk

Binary Logs Filled the Disk Because expire_logs_days Was Set to 30 on a High-Write Server

Our MySQL server generated 15GB of binary log per day. With expire_logs_days = 30, we needed 450GB just for binary log retention — on a 200GB volume. The calculation every DBA should do before setting binary log retention, the PURGE BINARY LOGS command to safely reclaim space immediately, and how to set binlog_expire_logs_seconds correctly for your write rate.

⏱ 7 minMySQL 5.7+ / MariaDB 10Feb 12, 2026

Read →

MySQL Schema Online DDL

ALTER TABLE Took the Table Offline for 4 Hours Until We Used the Right Syntax

Adding a column to a 200-million-row table with the wrong ALTER TABLE syntax acquired a metadata lock that blocked all reads and writes for four hours. The difference between ALGORITHM=COPY, ALGORITHM=INPLACE, and ALGORITHM=INSTANT, which changes each supports, and why pt-online-schema-change or gh-ost is still the right answer for tables you cannot afford to lock.

⏱ 10 minMySQL 5.6+ / MariaDB 10Feb 5, 2026

Read →

Oracle Undo / MVCC Post-Mortem

ORA-01555 Snapshot Too Old: The Error That Came Back Every Tuesday at 2 PM

Every Tuesday at 2:05 PM, the month-end reporting job failed with ORA-01555. Same time every week, for eleven consecutive weeks. By the time the DBA team looked, the database was healthy and the evidence was gone. V$UNDOSTAT revealed the cause: a 22-minute report running in parallel with a batch job generating 40x normal undo volume, against an UNDO_RETENTION set to only 15 minutes.

⏱ 11 minOracle 11g+Mar 18, 2026

Read →

Oracle Locking Post-Mortem

V$SESSION Shows the Blocker as Dead. The Lock Is Very Much Alive.

V$SESSION showed STATUS=KILLED for the blocking session. The application had crashed at the OS level 40 minutes earlier. But PMON was still rolling back 2GB of uncommitted work — and every lock that session held was still active. The query to monitor PMON rollback progress, when to escalate to OS-level kill, and the idle session timeout profile that limits how long a crashed session can block.

⏱ 9 minOracle 11g+Mar 11, 2026

Read →

Oracle Memory Shared Pool

ORA-04031: Shared Pool Exhaustion During the Month-End Report Window

The reporting framework was generating one unique cursor per order ID — 94,000 distinct SQL strings for what was effectively one parameterised query. Each one lived in the shared pool as a separate non-reusable cursor, filling it completely. The V$SQL query that revealed the cursor flood, CURSOR_SHARING as an emergency fix, and why bind variables in application code is the only real solution.

⏱ 10 minOracle 11g+Mar 5, 2026

Read →

Oracle Undo / Memory detailed analysis

Undo Segment Contention: When V$WAITSTAT Shows undo header Waits

During peak hours, AWR Top 5 Timed Events showed undo segment tx slot and undo header waits consuming 18% of DB time. Load was identical to previous weeks. The database had grown past the point where its undo configuration could handle the concurrency. How to read V$WAITSTAT for undo contention, calculate required undo tablespace size from V$UNDOSTAT, and fix it without downtime.

⏱ 10 minOracle 10g+Feb 26, 2026

Read →

Oracle RAC High Availability Post-Mortem

RAC Node Eviction at 2 AM: The Network Split Nobody Saw Coming

Node 2 was forcibly evicted by CRS and rebooted. All 847 sessions dropped. The cause: a 1,800ms interconnect latency spike during a network maintenance window the DBA team had not been told about. CSS heartbeats missed their delivery window and node 1 concluded node 2 was dead. The CRS and alert log evidence, how to detect interconnect problems proactively, and why increasing CSS timeouts is not the fix.

⏱ 12 minOracle RAC 12c+Feb 19, 2026

Read →

Oracle CPU / Performance Post-Mortem

Parallel Query Ate All 64 CPU Cores and Killed Everything Else

One analyst query. One 900-million-row fact table with DEGREE DEFAULT set during creation. Oracle spawned 64 parallel slave processes, monopolised every CPU core on the server, and pushed OLTP query latency from milliseconds to 40 seconds. How to find and kill the coordinator session, setting PARALLEL_MAX_SERVERS, and the Oracle Resource Manager plan that now gives analysts 20% CPU and OLTP 80%.

⏱ 11 minOracle 12c+Feb 12, 2026

Read →