The migration script had run successfully in staging. It ran for 4 hours in production before a junior engineer noticed the entire orders table was locked and terminated it. 4 hours of queued reads and writes, every order placement failing with a lock wait timeout. The script used the wrong ALGORITHM — one that required a full table rebuild with an exclusive metadata lock instead of the online path that would have taken 40 seconds.
The Alert
09:15. Application error rate climbs immediately after a deployment step runs ALTER TABLE orders ADD COLUMN notes TEXT. Every insert and select on orders returns lock wait timeout. The ALTER was still running. It had been running for 3 minutes before the first alert fired.
First Hypothesis: Large Table, Slow Operation
The orders table had 200 million rows. An ALTER TABLE on 200M rows takes time. The assumption was that this was expected and would complete. At the 2-hour mark, with no end in sight and orders failing, the assumption changed.
The Discovery
-- Find the ALTER TABLE process: SELECT id, user, host, db, command, time, state, info FROM information_schema.processlist WHERE info LIKE 'ALTER%'; -- State: "copy to tmp table" = ALGORITHM=COPY, full table rebuild with lock -- State: "Waiting for table metadata lock" = stuck waiting on the ALTER -- Count waiting connections: SELECT COUNT(*) AS waiting_count, state FROM information_schema.processlist WHERE state LIKE 'Waiting for table%' GROUP BY state;
State: copy to tmp table. ALGORITHM=COPY. MySQL was copying every row from orders to a new tmp table with the new column, then renaming. For 200 million rows this took hours. During the copy, a shared metadata lock prevents concurrent writes. Every INSERT and UPDATE to orders was queuing behind the ALTER.
Incident Timeline
| Time | Event |
|---|---|
| 09:15 | ALTER TABLE orders ADD COLUMN notes TEXT begins. ALGORITHM=COPY (default). |
| 09:15 | Shared metadata lock acquired. Writes begin queuing. |
| 09:17 | Application timeouts begin. Error rate climbs. |
| 11:15 | Decision to kill the ALTER TABLE. KILL [process_id]. |
| 11:20 | ALTER killed. Rollback begins — another 15 minutes to clean tmp table. |
| 11:35 | Table available. Orders resume. Lock wait queue drains. |
| 12:00 | Same ALTER run with ALGORITHM=INSTANT. Completes in under 1 second. |
Root Cause
MySQL's default ALTER TABLE algorithm on older versions is COPY — which rebuilds the entire table. Adding a nullable column with no default in MySQL 8.0 supports ALGORITHM=INSTANT, which completes in milliseconds by modifying only the table metadata. The script had not specified ALGORITHM, using the slowest path on a 200M row table.
The Fix
-- Test which algorithms are available first: ALTER TABLE orders ADD COLUMN notes TEXT ALGORITHM = INSTANT; -- If ERROR 1846: INSTANT not supported, try INPLACE -- If INPLACE fails, use gh-ost or pt-online-schema-change -- ALGORITHM=INSTANT: metadata only, zero lock, zero data copy (MySQL 8.0) -- ALGORITHM=INPLACE: modifies in-place, allows concurrent DML -- ALGORITHM=COPY: full table copy, blocks writes — avoid on large tables -- Fail fast if the operation would lock: ALTER TABLE orders ADD COLUMN notes TEXT ALGORITHM = INSTANT, LOCK = NONE; -- fails immediately if a lock would be required
Prevention
All schema changes on tables over 1M rows require explicit ALGORITHM=INSTANT, LOCK=NONE or use gh-ost (GitHub's online schema change tool). Any ALTER that cannot run with LOCK=NONE goes through the gh-ost process which uses a ghost table, triggers, and final atomic swap with no application downtime. The migration script checklist now includes algorithm verification on staging with production row counts simulated via EXPLAIN.