Node 2 was forcibly evicted by the Cluster Ready Services stack at 02:14. All 847 sessions on node 2 dropped simultaneously. The cause, confirmed in the CRS alert log, was interconnect heartbeat loss — CSS had missed enough consecutive heartbeats to conclude that node 2 was unresponsive. Node 2 was not unresponsive. The interconnect was.
The Alert
02:14:07. Monitoring: all node 2 connections dropped simultaneously. Application layer: mass reconnection to node 1. CSS alert log: Evicting node 2 (hostname: rac-node2). The database itself was healthy on node 1. Node 2 was being rebooted by CRS as part of the eviction protocol.
First Hypothesis: Node 2 Hardware Failure
Node eviction happens when CSS concludes a node is non-functional. First assumption: node 2 had experienced a hardware fault. We checked the OS-level logs on node 2 after it came back online.
The Discovery
-- CRS alert log location (on each node): -- $GRID_HOME/log/[hostname]/alertclusterware.log -- Look for CSS heartbeat messages around the eviction time: -- grep -A5 -B5 "Evicting" $GRID_HOME/log/rac-node2/alertclusterware.log -- Key indicators in the log: -- "Missed heartbeat from node 2" — repeated -- "Network latency detected on interconnect" -- "Node 2 declared dead" -- Check interconnect performance from AWR: SELECT inst_id, name, value FROM gv$sysstat WHERE name IN ( 'gcs messages sent', 'gcs messages received', 'ges messages sent', 'gc cr block received time' ) ORDER BY inst_id, name;
The CRS alert log showed 1,847ms latency spikes on the interconnect at 02:13:44. CSS's heartbeat timeout was configured at 1,500ms — the default. A single 1,847ms latency spike was enough to miss enough consecutive heartbeats for CSS to declare node 2 dead. The cause of the latency: network maintenance on the interconnect switch — a maintenance window that had been scheduled but not communicated to the DBA team.
Incident Timeline
| Time | Event |
|---|---|
| 02:00 | Network team begins scheduled maintenance on interconnect switch |
| 02:13:44 | Interconnect latency spikes to 1,847ms for 180ms duration |
| 02:13:45 | CSS misses heartbeats from node 2 during spike window |
| 02:14:07 | CSS declares node 2 dead. Eviction begins. |
| 02:14:07 | 847 sessions on node 2 drop. Application reconnects to node 1. |
| 02:14:30 | Node 2 rebooted by CRS eviction protocol. |
| 02:28 | Node 2 rejoins cluster. Traffic redistributed. |
Root Cause
RAC's CSS (Cluster Synchronisation Services) uses a heartbeat mechanism to detect node failures. If CSS misses heartbeats beyond the misscount threshold (default: 600ms × misscount=3 = 1,800ms), it evicts the node. A 1,847ms latency spike on the private interconnect during a network maintenance window exceeded this threshold. CSS correctly followed its protocol — but the threshold was not set for an environment with scheduled maintenance windows that briefly affect interconnect latency.
The Fix
-- Check current CSS timeout settings: -- crsctl get css disktimeout -- crsctl get css misscount -- Increase misscount to tolerate brief latency spikes: -- As grid infrastructure owner (not root, not oracle): -- crsctl set css misscount 30 # 30 × 200ms = 6 seconds tolerance -- This requires cluster restart — plan a maintenance window -- Do NOT simply increase disktimeout without testing: -- Higher values increase the time to detect a real node failure -- Monitor interconnect health proactively: SELECT a.inst_id, b.inst_id AS remote_inst, a.value AS gc_cr_blocks_received, b.value AS gc_cr_block_receive_time_ms FROM gv$sysstat a JOIN gv$sysstat b ON b.name = 'gc cr block receive time' WHERE a.name = 'gc cr blocks received';
Prevention
The network team now notifies the DBA team before any maintenance touching the RAC interconnect switch. A change freeze on the interconnect exists during business-critical windows. Interconnect latency is monitored from both nodes — any latency above 200ms fires an early warning alert before it approaches the CSS eviction threshold.