The read replica had been processing a single transaction for 14 hours. Seconds_Behind_Master reported 0 throughout. Every application component reading from the replica was getting data that was 14 hours old. The metric everyone was watching was lying.
The Alert
A customer reported that completed orders were not showing in their account history. The application read order history from the replica. The primary showed the orders as committed. The replica did not. Seconds_Behind_Master: 0. No replication errors in the MySQL error log.
First Hypothesis: Replication Broken
If Seconds_Behind_Master is 0 but data is missing, replication had stopped and the slave status was wrong. We checked SHOW SLAVE STATUS fully.
SHOW SLAVE STATUS\G
-- Key fields to check:
-- Slave_IO_Running: Yes (IO thread receiving binlog)
-- Slave_SQL_Running: Yes (SQL thread applying events)
-- Seconds_Behind_Master: 0
-- Last_SQL_Errno: 0
-- Exec_Master_Log_Pos vs Read_Master_Log_PosIO thread: running. SQL thread: running. Seconds_Behind_Master: 0. No errors. But Exec_Master_Log_Pos had not changed in 14 hours. The SQL thread was applying one transaction that had been running since 14:23 the day before.
The Discovery
-- Seconds_Behind_Master measures lag in binlog positions, -- not wall-clock time behind the primary. -- A single long-running transaction shows 0 while still applying. -- The real measure: -- On PRIMARY: SHOW MASTER STATUS; -- Record: File, Position -- On REPLICA: SHOW SLAVE STATUS\G -- Compare: Relay_Master_Log_File + Exec_Master_Log_Pos -- If master Position >> Exec_Master_Log_Pos, replica is behind -- even if Seconds_Behind_Master = 0 -- Check what the SQL thread is currently applying: SELECT * FROM performance_schema.events_statements_current WHERE processlist_user = 'system user'\G
The SQL thread was applying a mass UPDATE — 8 million rows, part of a batch job that had run on the primary at 14:23. On the primary it had committed in 4 minutes. On the replica it was single-threaded and had been running for 14 hours. MySQL 5.7 single-threaded SQL replication serializes all events through one thread.
Incident Timeline
| Time | Event |
|---|---|
| Primary 14:23 | Mass UPDATE: 8 million rows modified in one transaction. Commits in 4 minutes. |
| Replica 14:23 | SQL thread begins applying the transaction. Single-threaded. |
| Replica 14:23-next day | SQL thread still applying. Seconds_Behind_Master: 0. Position frozen. |
| Next day 04:00 | Customer reports stale data. Investigation begins. |
| +30 min | Exec_Master_Log_Pos mismatch identified. 14-hour applying transaction found. |
| +2 hours | Transaction completes on replica. Replica catches up immediately. |
Root Cause
Seconds_Behind_Master measures the timestamp difference between the last event the replica applied and the current master time. While the replica is actively applying a transaction, the metric reports the time of the event being applied — not how far behind the replica is in wall-clock terms. A 14-hour transaction applying on the replica shows Seconds_Behind_Master = 0 because the replica is actively executing.
The Fix
-- On PRIMARY: create a heartbeat table and update it every minute CREATE TABLE replication_heartbeat ( server_id INT PRIMARY KEY, heartbeat TIMESTAMP ); -- Cron on primary every 60 seconds: -- INSERT INTO replication_heartbeat VALUES (@@server_id, NOW()) -- ON DUPLICATE KEY UPDATE heartbeat = NOW(); -- On REPLICA: measure real lag SELECT TIMESTAMPDIFF(SECOND, heartbeat, NOW()) AS real_lag_seconds FROM replication_heartbeat; -- Alert if real_lag_seconds > 60
Prevention
Batch jobs on the primary now run in 50,000-row chunks with a brief sleep between chunks. This keeps individual transactions small enough that the replica's SQL thread can apply each chunk quickly. Seconds_Behind_Master is no longer used as a lag metric — it was removed from the dashboard and replaced with the heartbeat measurement.