Our Read Replica Was 14 Hours Behind — Seconds_Behind_M

The read replica had been processing a single transaction for 14 hours. Seconds_Behind_Master reported 0 throughout. Every application component reading from the replica was getting data that was 14 hours old. The metric everyone was watching was lying.

The Alert

A customer reported that completed orders were not showing in their account history. The application read order history from the replica. The primary showed the orders as committed. The replica did not. Seconds_Behind_Master: 0. No replication errors in the MySQL error log.

First Hypothesis: Replication Broken

If Seconds_Behind_Master is 0 but data is missing, replication had stopped and the slave status was wrong. We checked SHOW SLAVE STATUS fully.

Full slave status — read every field

MySQL

SHOW SLAVE STATUS\G

-- Key fields to check:
-- Slave_IO_Running: Yes   (IO thread receiving binlog)
-- Slave_SQL_Running: Yes  (SQL thread applying events)
-- Seconds_Behind_Master: 0
-- Last_SQL_Errno: 0
-- Exec_Master_Log_Pos vs Read_Master_Log_Pos

IO thread: running. SQL thread: running. Seconds_Behind_Master: 0. No errors. But Exec_Master_Log_Pos had not changed in 14 hours. The SQL thread was applying one transaction that had been running since 14:23 the day before.

The Discovery

The real lag measurement — not Seconds_Behind_Master

MySQL

-- Seconds_Behind_Master measures lag in binlog positions,
-- not wall-clock time behind the primary.
-- A single long-running transaction shows 0 while still applying.
-- The real measure:

-- On PRIMARY:
SHOW MASTER STATUS;
-- Record: File, Position

-- On REPLICA:
SHOW SLAVE STATUS\G
-- Compare: Relay_Master_Log_File + Exec_Master_Log_Pos
-- If master Position >> Exec_Master_Log_Pos, replica is behind
-- even if Seconds_Behind_Master = 0

-- Check what the SQL thread is currently applying:
SELECT * FROM performance_schema.events_statements_current
WHERE  processlist_user = 'system user'\G

The SQL thread was applying a mass UPDATE — 8 million rows, part of a batch job that had run on the primary at 14:23. On the primary it had committed in 4 minutes. On the replica it was single-threaded and had been running for 14 hours. MySQL 5.7 single-threaded SQL replication serializes all events through one thread.

Incident Timeline

Time	Event
Primary 14:23	Mass UPDATE: 8 million rows modified in one transaction. Commits in 4 minutes.
Replica 14:23	SQL thread begins applying the transaction. Single-threaded.
Replica 14:23-next day	SQL thread still applying. Seconds_Behind_Master: 0. Position frozen.
Next day 04:00	Customer reports stale data. Investigation begins.
+30 min	Exec_Master_Log_Pos mismatch identified. 14-hour applying transaction found.
+2 hours	Transaction completes on replica. Replica catches up immediately.

Root Cause

Seconds_Behind_Master measures the timestamp difference between the last event the replica applied and the current master time. While the replica is actively applying a transaction, the metric reports the time of the event being applied — not how far behind the replica is in wall-clock terms. A 14-hour transaction applying on the replica shows Seconds_Behind_Master = 0 because the replica is actively executing.

The Fix

Real-time replication lag measurement using heartbeat

MySQL

-- On PRIMARY: create a heartbeat table and update it every minute
CREATE TABLE replication_heartbeat (
  server_id   INT PRIMARY KEY,
  heartbeat   TIMESTAMP
);

-- Cron on primary every 60 seconds:
-- INSERT INTO replication_heartbeat VALUES (@@server_id, NOW())
-- ON DUPLICATE KEY UPDATE heartbeat = NOW();

-- On REPLICA: measure real lag
SELECT
  TIMESTAMPDIFF(SECOND, heartbeat, NOW()) AS real_lag_seconds
FROM   replication_heartbeat;

-- Alert if real_lag_seconds > 60

Prevention

Batch jobs on the primary now run in 50,000-row chunks with a brief sleep between chunks. This keeps individual transactions small enough that the replica's SQL thread can apply each chunk quickly. Seconds_Behind_Master is no longer used as a lag metric — it was removed from the dashboard and replaced with the heartbeat measurement.

Ananya Menon

MySQL DBA · 9 years production · QueryTuning

Ananya has spent 9 years managing MySQL and MariaDB installations for SaaS companies. She specialises in replication architecture, binary log management, and the failure modes that only appear under production load.

Our Read Replica Was 14 Hours Behind — and No Alert Fired

The Alert

First Hypothesis: Replication Broken

The Discovery

Incident Timeline

Root Cause

The Fix

Prevention

The Same Problem on Other Engines