Eleven Tuesdays in a row, at 14:07, the same report failed with ORA-01555: snapshot too old. Each time, the database looked healthy. Undo was available. The session count was normal. By the time anyone investigated, the evidence had been overwritten by the undo retention mechanism itself. Week twelve, we caught it live.
The Alert
ORA-01555 means Oracle could not reconstruct a consistent read image of a block because the undo data needed had been overwritten. The report ran for 22 minutes. During those 22 minutes, a batch job that ran every Tuesday at 14:00 was generating undo at 40x the normal rate. The report's consistent read view needed undo data from 14:00. The batch job had overwritten it by 14:22.
First Hypothesis: Undo Tablespace Too Small
The first hypothesis was that the undo tablespace was undersized. We checked V$UNDOSTAT to see undo generation and retention.
SELECT begin_time, end_time, undoblks, -- undo blocks generated in this period txncount, -- transaction count maxquerylen, -- longest query in seconds ssolderrcnt, -- ORA-01555 errors in this period nospaceerrcnt, -- ORA-30036 (undo space full) activeblks, -- active undo blocks unexpiredblks -- unexpired (retained) undo blocks FROM v$undostat ORDER BY begin_time DESC FETCH FIRST 24 ROWS ONLY;
The data showed a clear spike: at 14:00 every Tuesday, undoblks jumped from ~200 per 10-minute period to ~8,400. The batch job — a financial reconciliation process — was performing 400,000 DELETE + INSERT pairs. The maxquerylen during the spike was 1,340 seconds — the report. The undo needed for the report's consistent read was being overwritten by the batch job's DELETE operations.
Incident Timeline
| Time | Event |
|---|---|
| 14:00 (every Tue) | Financial reconciliation batch begins. 400K DELETEs + INSERTs. |
| 14:07 | Long-running report starts. Needs consistent view from 14:00. |
| 14:07-14:22 | Batch overwrites undo blocks needed by the report's read view. |
| 14:22-14:29 | Report queries blocks whose undo has been overwritten. |
| 14:29 | ORA-01555: snapshot too old. Report fails. |
| Week 12 | V$UNDOSTAT queried live during the batch. Spike pattern confirmed. |
| Week 12 + 1 hour | UNDO_RETENTION raised from 900s to 3600s |
| Week 13 | Report completes. No ORA-01555. |
Root Cause
UNDO_RETENTION was set to 900 seconds (15 minutes). The report ran for 22 minutes. The batch job generated enough undo volume to overwrite 15+ minutes of undo history in under 8 minutes. The report needed undo data from its start time, which was more than 900 seconds old by the time it encountered the overwritten blocks.
The Fix
-- Raise retention beyond the longest expected query duration: ALTER SYSTEM SET undo_retention = 3600; -- 1 hour -- Enable guaranteed retention on the undo tablespace -- (prevents undo from being overwritten even under space pressure): ALTER TABLESPACE undotbs1 RETENTION GUARANTEE; -- Monitor undo adequacy: SELECT maxquerylen, ssolderrcnt, ROUND(undoblks * 8192 / 1024/1024, 2) AS undo_mb_per_period FROM v$undostat ORDER BY begin_time DESC FETCH FIRST 6 ROWS ONLY;
Prevention
UNDO_RETENTION is now set to 2x the expected maximum query duration. The batch job was rescheduled to 02:00 on Wednesday — outside of business hours, away from long-running reports. V$UNDOSTAT is monitored every 10 minutes for ssolderrcnt increments — any non-zero value fires an alert before the next failure.