RAC Node Eviction at 2 AM: The Network Split Nobody Saw

Node 2 was forcibly evicted by the Cluster Ready Services stack at 02:14. All 847 sessions on node 2 dropped simultaneously. The cause, confirmed in the CRS alert log, was interconnect heartbeat loss — CSS had missed enough consecutive heartbeats to conclude that node 2 was unresponsive. Node 2 was not unresponsive. The interconnect was.

The Alert

02:14:07. Monitoring: all node 2 connections dropped simultaneously. Application layer: mass reconnection to node 1. CSS alert log: Evicting node 2 (hostname: rac-node2). The database itself was healthy on node 1. Node 2 was being rebooted by CRS as part of the eviction protocol.

First Hypothesis: Node 2 Hardware Failure

Node eviction happens when CSS concludes a node is non-functional. First assumption: node 2 had experienced a hardware fault. We checked the OS-level logs on node 2 after it came back online.

The Discovery

Read the CRS alert log for the eviction cause

Oracle RAC

-- CRS alert log location (on each node):
-- $GRID_HOME/log/[hostname]/alertclusterware.log

-- Look for CSS heartbeat messages around the eviction time:
-- grep -A5 -B5 "Evicting" $GRID_HOME/log/rac-node2/alertclusterware.log

-- Key indicators in the log:
-- "Missed heartbeat from node 2" — repeated
-- "Network latency detected on interconnect"  
-- "Node 2 declared dead"

-- Check interconnect performance from AWR:
SELECT
  inst_id,
  name,
  value
FROM   gv$sysstat
WHERE  name IN (
  'gcs messages sent',
  'gcs messages received',
  'ges messages sent',
  'gc cr block received time'
)
ORDER BY inst_id, name;

The CRS alert log showed 1,847ms latency spikes on the interconnect at 02:13:44. CSS's heartbeat timeout was configured at 1,500ms — the default. A single 1,847ms latency spike was enough to miss enough consecutive heartbeats for CSS to declare node 2 dead. The cause of the latency: network maintenance on the interconnect switch — a maintenance window that had been scheduled but not communicated to the DBA team.

Incident Timeline

Time	Event
02:00	Network team begins scheduled maintenance on interconnect switch
02:13:44	Interconnect latency spikes to 1,847ms for 180ms duration
02:13:45	CSS misses heartbeats from node 2 during spike window
02:14:07	CSS declares node 2 dead. Eviction begins.
02:14:07	847 sessions on node 2 drop. Application reconnects to node 1.
02:14:30	Node 2 rebooted by CRS eviction protocol.
02:28	Node 2 rejoins cluster. Traffic redistributed.

Root Cause

RAC's CSS (Cluster Synchronisation Services) uses a heartbeat mechanism to detect node failures. If CSS misses heartbeats beyond the misscount threshold (default: 600ms × misscount=3 = 1,800ms), it evicts the node. A 1,847ms latency spike on the private interconnect during a network maintenance window exceeded this threshold. CSS correctly followed its protocol — but the threshold was not set for an environment with scheduled maintenance windows that briefly affect interconnect latency.

The Fix

Adjust CSS disk timeout and implement change notification process

Oracle RAC

-- Check current CSS timeout settings:
-- crsctl get css disktimeout
-- crsctl get css misscount

-- Increase misscount to tolerate brief latency spikes:
-- As grid infrastructure owner (not root, not oracle):
-- crsctl set css misscount 30   # 30 × 200ms = 6 seconds tolerance
-- This requires cluster restart — plan a maintenance window

-- Do NOT simply increase disktimeout without testing:
-- Higher values increase the time to detect a real node failure

-- Monitor interconnect health proactively:
SELECT
  a.inst_id,
  b.inst_id    AS remote_inst,
  a.value      AS gc_cr_blocks_received,
  b.value      AS gc_cr_block_receive_time_ms
FROM   gv$sysstat a
JOIN   gv$sysstat b ON b.name = 'gc cr block receive time'
WHERE  a.name = 'gc cr blocks received';

Prevention

The network team now notifies the DBA team before any maintenance touching the RAC interconnect switch. A change freeze on the interconnect exists during business-critical windows. Interconnect latency is monitored from both nodes — any latency above 200ms fires an early warning alert before it approaches the CSS eviction threshold.

Raj Nair

Oracle DBA · QueryTuning

Raj has spent 10 years managing Oracle OLTP and data warehouse databases across single-instance and RAC environments. He specialises in memory tuning, cluster management, and Oracle internals.

RAC Node Eviction at 2 AM: The Network Split Nobody Saw Coming