Saturday, March 7, 2009

Linux, Oracle RAC and the Bonding conundrum

I recently ran into a problem wherein after patching a 4 node Linux RAC cluster, 2 out of 4 instances would not startup. I could not start the instances manually either.

Both querying and trying to restart the crs hung.

On closer look, the crsd.log had entries like

2009-03-05 23:38:28.600: [ CRSRTI][2541399584]0CSS is not ready. Received status 3 from CSS. Waiting for good status ..

2009-03-05 23:38:29.814: [ COMMCRS][1084229984]clsc_connect: (0xb76f00) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_tus1dwhdbssex04_CODSS))

2009-03-05 23:38:29.814: [ CSSCLNT][2541399584]clsssInitNative: connect failed, rc 9

I checked the OCSS log and it had entries like the below:

CSSD]2009-03-05 23:40:01.566 [1273047392] >ERROR: clssgmSlaveCMSync: reconfig timeout on master 1
[ CSSD]2009-03-05 23:40:01.566 [1273047392] >TRACE: clssgmReconfigThread: completed for reconfig(16), with status(0)
[ CSSD]2009-03-05 23:44:31.610 [2538397376] >ERROR: clssgmStartNMMon: reconfig incarn 16 failed. Retrying.

Searching metalink showed no hits, neither did google.

The alertlog for CRS showed that the voting disks was online and the system showed that the interconnects were up. The cluster was setup to use Network interfaces which were bonded together in an Active-Passive mode ( mode=1).

I tried the usual methods - deleting the pipes in /var/tmp/.oracle and restarting the systems a couple of times, however it did not seem to fix the problem. Any attempts to restart crsd from /etc/init.d/init.crsd or crsctl failed. The commands would simply hang and so the nodes had to be force restarted - either by reboot or killing the crs/ocss daemons.

Finally, I checked the active member of the bond interface for the cluster interconnect and found that on 2 of the nodes, the active interface was different from the other 2 nodes. You can identify the active interface by looking into the /proc/net/bonding/bond file.

[oracodss@tus1dwhdbssex02] /apps/oracle/home/crs10203/log/tus1dwhdbssex02$ cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v2.6.3 (June 8, 2005)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth5
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth1
MII Status: up
Link Failure Count: 5
Permanent HW addr: 00:1c:23:bf:6e:73

Slave Interface: eth5
MII Status: up
Link Failure Count: 5
Permanent HW addr: 00:15:17:49:75:33

For the 2 nodes which were up, eth5 was the primary slave, whereas for the 2 down nodes, eth1 was the primary.

So I deleted all the pipes in /var/tmp/oracle, rebooted the 2 down nodes and changed the primary slaves on these 2 nodes to eth5. The instances came back up immediately.

You can change the primary slave using ifenslave or by hard-coding the interface when bringing up the bond interface.

# ifenslave -c bond1 eth5

install bond1 /sbin/modprobe bonding -o bond1 miimon=100 mode=1 primary=eth5

Active-Passive mode is a common method of bonding in which there is no load-balancing, but only failover (in case a link fails). In a typical HA environment, each interface on the bond is setup to be cabled to a different switch which provides switch level failover capabilities as well. In the Active-Passive mode, the passive link does not even arp, so you would not see any MAC addresses on the switch port for this interface. There is complete isolation between interface 1 and 2.

Imagine a scene wherein Node A uses eth1 as Active and Node B uses eth5 as Active. Even though the Nodes are cabled to the same switches and the links are up/active, Node A will not be able to communicate to Node B.

Here is the funny part - If you setup the cluster interconnect to point to a non-existent interface, I have seen Oracle using the public interface for cluster communications. You will see entries in the instance alert logs mentioning that the cluster interconnect is non-existent/incorrect and so using the public interface.

However, if you set the cluster interconnect to an available and up interface and it is not able to reach the other nodes using the cluster interconnect, it does not try to use the public interface to check connectivity (in the scenario that the vote disks are online/available). Maybe this is a bug? Not surprising at all.

One would assume that a failure of interconnects when the voting disks are online/available should result in more legitimate error logs other than cryptic messages that fill up the CRS logs. This is another kind of split brain condition and probably should be well documented.
Maybe Oracle needs to learn from VCS or other HA products which have been in the market longer and are more stable.

No comments: