SQL Server
TempDBPost-Mortem · RCA
NEW
TempDB Allocation Contention Stalled Our Entire OLTP Workload — One Configuration Line Fixed It
Twelve stored procedures running concurrently triggered PAGELATCH_EX waits on TempDB allocation pages. CPU was at 6%, memory was fine, no blocking chains — the standard monitoring showed nothing. The fix was adding four TempDB data files. The root cause was a default SQL Server installation we had never questioned.
⏱ 7 minSQL Server 2014+Apr 16, 2026
Read →
SQL Server
Query Performance
Plan Cache
The SET Options Your Application Sets Without Telling You — and How They Silently Break Query Performance
Your DBA runs a query in SSMS in 200ms. The application runs the same SQL in 8 seconds. Same parameters, same server — different plan. The culprit is almost always ARITHABORT OFF on your application driver causing a plan cache split. Full diagnosis path: reading the bitmask in sys.dm_exec_cached_plans, finding split entries, the filtered index silent bypass, and the fix without touching application code.
⏱ 13 minSQL Server 2016+Apr 7, 2026
Read →
SQL Server
Blocking
Post-Mortem
58 Seconds. That's How Long Our Checkout Was Completely Frozen on Black Friday
One SPID. One uncommitted transaction. One developer who opened SSMS on production to 'just quickly check something.' The full post-mortem of a 58-second checkout freeze on the highest-traffic day of the year — the blocking chain that revealed it, why the lock held, and the idle-in-transaction timeout we now enforce on every production database.
⏱ 11 minSQL Server 2014+Mar 29, 2026
Read →
SQL Server
Performance
Parameter Sniffing
Same Stored Procedure. 40ms on Dev. 22 Seconds on Prod. Here's Exactly Why.
The stored procedure was identical on both environments. The index was identical. The execution plan was different. One environment had compiled the plan with a common parameter value — the other had cached a plan compiled for a rare edge-case value that forced a different join strategy. The full diagnosis, when OPTION(RECOMPILE) is the right answer, and when it isn't.
⏱ 11 minSQL Server 2014+Mar 25, 2026
Read →
SQL Server
Concurrency
RCSI
We Enabled Read-Committed Snapshot Isolation on a 2TB Live Database with Four Minutes of Impact
The documentation says enabling RCSI requires exclusive database access. Our database never had zero connections. Here is every step of the migration: how we staged it, how we monitored the version store growth during the switchover, what we tested first, and how four minutes of controlled impact affected 800 concurrent users less than we feared.
⏱ 13 minSQL Server 2016+Mar 22, 2026
Read →
SQL Server
TempDB
I/O
A Single Sort Query Wrote 400GB to TempDB in 12 Minutes and Took Down Reporting
The query had an ORDER BY. The execution plan showed a Sort operator with no memory grant warning in SSMS — because SSMS tested it with a small result set. In production, the sort received 4GB of data, spilled immediately to TempDB, and wrote 400GB of sort runs over 12 minutes. The sys.dm_io_virtual_file_stats query that caught it and the fix.
⏱ 11 minSQL Server 2016+Mar 18, 2026
Read →
SQL Server
CPU
Runbook
SQL Server CPU at 100%: The First Ten Queries to Run When the Alert Fires
Systematic runbook for CPU saturation in SQL Server. In order of likelihood: rogue query consuming CPU in a tight loop, missing index causing repeated table scans, blocking cascade where waiting sessions consume CPU spinning, runaway Agent job, or sudden connection surge. This walks all five paths with the exact DMV queries for each, in the order most likely to find your specific problem first.
⏱ 9 minSQL Server 2014+Mar 15, 2026
Read →
SQL Server
Indexing
Post-Mortem
We Had 47 Indexes on One Table. None of Them Were the Right One.
Years of automated index recommendations from sys.dm_db_missing_index_details had produced a table with 47 indexes and 800ms write latency. The audit process that revealed which indexes had zero reads in 90 days, which were redundant prefixes of wider indexes, and how we got from 47 to 12 — including the one index we had to add that DMVs had never suggested.
⏱ 14 minSQL Server 2014+Mar 11, 2026
Read →
SQL Server
Deadlock
Extended Events
Deadlock Between Two Stored Procedures — Diagnosed Using the Extended Events Deadlock Graph
Two stored procedures, both updating the orders and inventory tables, but in opposite order. Classic deadlock. The Extended Events system_health ring buffer captured the full XML deadlock graph. This walks through reading the graph, identifying the resource nodes, understanding which transaction was chosen as the victim and why, and the single code change that eliminated the deadlock permanently.
⏱ 12 minSQL Server 2012+Mar 5, 2026
Read →
SQL Server
Locking
Lock Escalation
Lock Escalation Was Quietly Converting Our Row Locks to Table Locks on Every Bulk Update
SQL Server automatically escalates from row locks to a single table lock when a transaction acquires more than 5,000 row locks. Our bulk update touched 6,000+ rows, triggered escalation, and blocked every read on the table for the duration. Why this happens, how to detect it in sys.dm_tran_locks, and the three ways to prevent it — including the one that doesn't require changing your query.
⏱ 10 minSQL Server 2008+Feb 28, 2026
Read →
PostgreSQL
Performance
Plan Regression
Why Your PostgreSQL Query Plan Changed Overnight — and How to Pin It Back
Nobody touched the schema. Nobody changed the query. But the planner picked a different execution path and p99 latency went from 18ms to 4.2 seconds overnight. This almost always comes down to one of three root causes: stale statistics, histogram skew on a key column, or a new index reshaping the planner's option set. Three diagnostic paths with exact queries at each step.
⏱ 12 minPostgreSQL 12+Apr 1, 2026
Read →
PostgreSQL
Autovacuum
IOPS
The Autovacuum That Quietly Ate Our Entire IOPS Budget for Six Weeks
Every night between 3 and 4 AM, database IOPS saturated and query latency spiked. By 5 AM it was gone with no error in the logs. Six weeks of investigation revealed autovacuum running with a cost_delay setting tuned for spinning disk on an NVMe SSD — throttled just enough to run for hours while barely making progress, consuming all available I/O the entire time.
⏱ 14 minPostgreSQL 10+Apr 2, 2026
Read →
PostgreSQL
Replication
WAL
The Replication Slot That Filled the Disk While We Watched the Wrong Dashboards
We were monitoring active replication senders — connection count, lag in bytes, lag in seconds. The idle slot that had been accumulating WAL for six days appeared on none of those dashboards. pg_replication_slots told a different story: one slot, zero active connections, 47GB of retained WAL and growing. The monitoring blind spot and how to close it permanently.
⏱ 11 minPostgreSQL 10+Mar 28, 2026
Read →
PostgreSQL
Memory
OOM
work_mem Set Too High: How We Triggered an OOM Kill on Every Reporting Query
Setting work_mem to 1GB for reporting queries felt conservative. Then three simultaneous reports each spawned 12 sort nodes. The per-process memory calculation nobody does before setting this parameter globally — and why the fix was ALTER ROLE reporter SET work_mem = '64MB' rather than touching the global setting that was affecting every session on the server.
⏱ 8 minPostgreSQL 9.4+Mar 21, 2026
Read →
PostgreSQL
Connections
PgBouncer
2,000 Idle Connections Were Consuming an Entire CPU Core Before We Found It
Each idle PostgreSQL connection holds a backend process. At 2,000 connections, Linux process scheduling overhead alone was consuming a full CPU core and adding 30ms of latency to every query. The diagnosis query, why raising max_connections made things worse, and the PgBouncer transaction-mode configuration that reduced effective connections from 2,000 to 40.
⏱ 10 minPostgreSQL 9.6+Mar 14, 2026
Read →
PostgreSQL
WAL Archiving
Disk
Our archive_command Was Silently Returning 0 While Failing — WAL Filled the Disk in Six Days
The archive_command shell wrapper was catching rsync errors and returning exit code 0 to PostgreSQL. PostgreSQL believed archiving was succeeding. WAL segments accumulated in pg_wal for six days until the disk hit 100%. The pg_stat_archiver query that would have caught this on day one, and the monitoring alert that now runs every 5 minutes.
⏱ 9 minPostgreSQL 9.4+Mar 7, 2026
Read →
PostgreSQL
Indexing
Partial Index
The Partial Index That Fixed a Reporting Query Nobody Could Crack for Two Years
A 200-million-row orders table, a filter on status = 'active' — which covered 0.3% of rows. A standard B-tree index on (status, created_at) was useless because the planner estimated it would need to scan too many index pages to reach the active rows. A partial index with WHERE status = 'active' contained only those 600,000 rows and dropped query time from 40 seconds to 180ms.
⏱ 9 minPostgreSQL 9.5+Feb 28, 2026
Read →
PostgreSQL
Indexing
Post-Mortem
The Index That Made Our Queries Slower — and Why the Planner Chose It Anyway
A composite index on (user_id, created_at, status) looked correct — it covered all the WHERE clause columns. But the planner chose it over a smaller partial index covering only active rows because the statistics made it appear cheaper. The full explanation of why the planner was wrong, how to read pg_stats to understand the bad estimate, and the fix.
⏱ 12 minPostgreSQL 10+Feb 21, 2026
Read →
PostgreSQL
Vacuum
Bloat
Table Bloat After a Mass Delete: What PostgreSQL Does with Your Disk Space
We deleted 70% of a 500GB table. Disk usage barely changed. PostgreSQL marks deleted rows as dead tuples but does not return the space to the OS immediately — and a standard VACUUM only makes the space available for reuse within the table. The explanation of why this is correct behaviour, when to use VACUUM FULL, and why pg_repack is almost always the better answer.
⏱ 10 minPostgreSQL 9.5+Feb 14, 2026
Read →
MySQL
Replication
Post-Mortem
Our Read Replica Was 14 Hours Behind — and No Alert Fired
Seconds_Behind_Master read 0 in SHOW SLAVE STATUS. But the replica was 14 hours behind. The metric lied because a stalled SQL thread and an idle I/O thread produce a 0 lag reading even when no replication is happening. The pt-heartbeat approach to accurate lag measurement, the alert threshold that should have fired, and how a single large transaction caused the delay.
⏱ 10 minMySQL 5.7+ / MariaDB 10Mar 26, 2026
Read →
MySQL
Deadlock
Gap Locks
InnoDB Gap Locks and the Deadlock Between Rows That Didn't Exist Yet
Two transactions. Different target rows. Neither was modifying existing data the other needed. Still deadlocked. REPEATABLE READ isolation level and a range query in one transaction caused a gap lock on the range where the other transaction wanted to insert. The SHOW ENGINE INNODB STATUS deadlock graph, exactly which gap lock was held, and the isolation level change that eliminated it.
⏱ 10 minMySQL 5.6+ / MariaDB 10Mar 19, 2026
Read →
MySQL
Connections
Post-Mortem
Too Many Connections: Why Raising max_connections Made Our Outage Worse
We hit max_connections=151 and raised it to 500. Within 60 seconds, 500 connections were open and performance was worse than before. The hidden cost of MySQL connections — each holds memory, file descriptors, and thread resources — and why connection pooling with ProxySQL or MySQL Router, not a higher max_connections, is the correct solution.
⏱ 9 minMySQL 5.7+Mar 12, 2026
Read →
MySQL
Performance
Slow Query Log
The Slow Query Log Saved Our Migration. Here's How to Actually Read It.
We enabled the slow query log during a schema migration and found three queries we didn't know existed consuming 80% of database time. How to enable it without performance impact, what the critical fields mean (Query_time vs Lock_time vs Rows_examined), and how pt-query-digest groups and ranks 10,000 slow query events into an actionable report in under a minute.
⏱ 8 minMySQL 5.6+ / MariaDB 10Mar 5, 2026
Read →
MySQL
Memory
InnoDB
InnoDB Buffer Pool Was 4GB on a Server with 64GB RAM and We Wondered Why Reads Were Slow
The default InnoDB buffer pool is 128MB. We had set it to 4GB — which seemed generous until we checked how large our working dataset actually was. The queries to measure your actual working set size, how to interpret the buffer pool hit rate, why the hit rate metric alone is misleading, and the rule for sizing the buffer pool correctly.
⏱ 8 minMySQL 5.7+ / MariaDB 10Feb 26, 2026
Read →
MySQL
Replication
Filters
Replica Filters Were Silently Dropping Rows and We Didn't Notice for Three Weeks
REPLICATE_IGNORE_TABLE was configured on our replica to skip an audit log table. When the application started writing to a new table with a similar naming pattern, the filter matched it and dropped all replication events for that table. The replica silently diverged from primary for three weeks. How we detected the divergence with pt-table-checksum and how we recovered.
⏱ 11 minMySQL 5.6+ / MariaDB 10Feb 19, 2026
Read →
MySQL
Binary Log
Disk
Binary Logs Filled the Disk Because expire_logs_days Was Set to 30 on a High-Write Server
Our MySQL server generated 15GB of binary log per day. With expire_logs_days = 30, we needed 450GB just for binary log retention — on a 200GB volume. The calculation every DBA should do before setting binary log retention, the PURGE BINARY LOGS command to safely reclaim space immediately, and how to set binlog_expire_logs_seconds correctly for your write rate.
⏱ 7 minMySQL 5.7+ / MariaDB 10Feb 12, 2026
Read →
MySQL
Schema
Online DDL
ALTER TABLE Took the Table Offline for 4 Hours Until We Used the Right Syntax
Adding a column to a 200-million-row table with the wrong ALTER TABLE syntax acquired a metadata lock that blocked all reads and writes for four hours. The difference between ALGORITHM=COPY, ALGORITHM=INPLACE, and ALGORITHM=INSTANT, which changes each supports, and why pt-online-schema-change or gh-ost is still the right answer for tables you cannot afford to lock.
⏱ 10 minMySQL 5.6+ / MariaDB 10Feb 5, 2026
Read →
Oracle
Undo / MVCC
Post-Mortem
ORA-01555 Snapshot Too Old: The Error That Came Back Every Tuesday at 2 PM
Every Tuesday at 2:05 PM, the month-end reporting job failed with ORA-01555. Same time every week, for eleven consecutive weeks. By the time the DBA team looked, the database was healthy and the evidence was gone. V$UNDOSTAT revealed the cause: a 22-minute report running in parallel with a batch job generating 40x normal undo volume, against an UNDO_RETENTION set to only 15 minutes.
⏱ 11 minOracle 11g+Mar 18, 2026
Read →
Oracle
Locking
Post-Mortem
V$SESSION Shows the Blocker as Dead. The Lock Is Very Much Alive.
V$SESSION showed STATUS=KILLED for the blocking session. The application had crashed at the OS level 40 minutes earlier. But PMON was still rolling back 2GB of uncommitted work — and every lock that session held was still active. The query to monitor PMON rollback progress, when to escalate to OS-level kill, and the idle session timeout profile that limits how long a crashed session can block.
⏱ 9 minOracle 11g+Mar 11, 2026
Read →
Oracle
Memory
Shared Pool
ORA-04031: Shared Pool Exhaustion During the Month-End Report Window
The reporting framework was generating one unique cursor per order ID — 94,000 distinct SQL strings for what was effectively one parameterised query. Each one lived in the shared pool as a separate non-reusable cursor, filling it completely. The V$SQL query that revealed the cursor flood, CURSOR_SHARING as an emergency fix, and why bind variables in application code is the only real solution.
⏱ 10 minOracle 11g+Mar 5, 2026
Read →
Oracle
Undo / Memory
detailed analysis
Undo Segment Contention: When V$WAITSTAT Shows undo header Waits
During peak hours, AWR Top 5 Timed Events showed undo segment tx slot and undo header waits consuming 18% of DB time. Load was identical to previous weeks. The database had grown past the point where its undo configuration could handle the concurrency. How to read V$WAITSTAT for undo contention, calculate required undo tablespace size from V$UNDOSTAT, and fix it without downtime.
⏱ 10 minOracle 10g+Feb 26, 2026
Read →
Oracle RAC
High Availability
Post-Mortem
RAC Node Eviction at 2 AM: The Network Split Nobody Saw Coming
Node 2 was forcibly evicted by CRS and rebooted. All 847 sessions dropped. The cause: a 1,800ms interconnect latency spike during a network maintenance window the DBA team had not been told about. CSS heartbeats missed their delivery window and node 1 concluded node 2 was dead. The CRS and alert log evidence, how to detect interconnect problems proactively, and why increasing CSS timeouts is not the fix.
⏱ 12 minOracle RAC 12c+Feb 19, 2026
Read →
Oracle
CPU / Performance
Post-Mortem
Parallel Query Ate All 64 CPU Cores and Killed Everything Else
One analyst query. One 900-million-row fact table with DEGREE DEFAULT set during creation. Oracle spawned 64 parallel slave processes, monopolised every CPU core on the server, and pushed OLTP query latency from milliseconds to 40 seconds. How to find and kill the coordinator session, setting PARALLEL_MAX_SERVERS, and the Oracle Resource Manager plan that now gives analysts 20% CPU and OLTP 80%.
⏱ 11 minOracle 12c+Feb 12, 2026
Read →