Sunday, March 8, 2009

Temporary Tablespaces on Oracle - going the extra mile

During a recent performance debug session on a Linux 10.2.0.3 RAC cluster, I noticed that reads/writes to the temp tablespace were occurring at the rate of 240K/IOP. Even though direct path reads were happening at 1024K/IOP, IOP size to temp tablespaces did not exceed 240K.

I noticed that IO sizes varied between 48K and 240K. A 10046 trace also revealed that Oracle was issuing reads/writes at the max of 15 blocks and min of 3 blocks. The db block size was 16K and thus 15 blocks equals 240K IOP size and 3 blocks equals 48K.

At first, I thought this was a limitation on the raw volumes which we were using for the temporary tablespaces, but a test using dd to the raw volumes showed that we could go upto 4M/IOP without the IO request getting broken up (this was the limit set at the volume level).

I opened a tar with Oracle and it did not get me any where. I found it odd that Oracle would use 15 block sized IOPs. 15 is an odd number and not Oracle style. Just by sheer luck, I was going through the hidden parameters list and came across 2 parameters

_smm_auto_max_io_size 240 Maximum IO size (in KB) used by sort/hash-join in auto mode

_smm_auto_min_io_size 48 Minimum IO size
(in KB) used by sort/hash-join in auto mode

I had found the problem - Oracle had set hash-joins and sorts that spill to the disk to not exceed 240KB. The minimum IO size was 48K and the maximum was 240KB. It fit in perfectly with the behavior I noticed.

Note that the parameter mentions "in auto mode" - so if you work area policy is set to auto, then these settings will take effect. The older settings - _hash_multiblock_io_count/_sort_multiblock_read_count has no effect if using auto work area policy and probably still control the hash/sort behaviour when in manual mode.

To me, 240K and 48K are on the lower end of a suitable size for an IOP. I could understand as to why Oracle would keep it at such low values - IO to the temp tablespaces ( if shared) is very much random and with bigger IOP sizes, it can cause a slowdown of your operation. But if your temp tablespaces are well designed, then you could easily use bigger sized IOP requests.

With bigger sized requests, you would reduce the number of waits. But there is a catch with using bigger sized requests - you will end up spending more time on each wait. There will be a sweet spot wherein the net result will be to your choice.

Our Linux RAC cluster was using temporary tablespace groups (4 tablespaces) and had a well designed IO subsystem. Each temp tablespace had a single temp file (raw volume) of 62GB and there were 4 such temp tablespaces. Being a 4 node RAC cluster, we had configured 4 tablespaces assuming that each node would prefer a separate tablespace (node affinity).

Each of the raw volumes was a Raid 1+0 configuration using 4*143GB, 15K RPM drives. The array was an Hitachi AMS1000 with 16GB of cache. The stripe width was 1M on the Raid 1+0 volume. There was a lot of space wasted, however, we were interested in performance and so space was not a concern. So ~250GB of Temp space was laid out across 16 drives (143GB, 15K RPM FC).

In order to test the performance with different sized IOPS, I took one of our "Taste the Thunder" sql's which does an insert of ~460 million rows after doing some large hash-joins on tables with + billion rows. Normally this sql takes around 27 minutes to complete. We do parallel dml and the insert is split across all the 4 nodes.

With the default settings of 240K/48K. (click to enlarge)

Changing max/min to 1024K/512K (click to enlarge)


Changing max/min to 1024K/1024K (click to enlarge)


The below summarizes the findings:

The best results where when we set the max/min to 1024K/512K. We had an improvement of 22% over the default.

As expected, with increasing IOP sizes, the number of waits reduce, however the average wait time increases. The sweet spot in our case was with a max/min setting of 1024K/512K.

With a max/min setting of 1024K/1024K, even though the reads were very fast, the write times increased exponentially - probably the array did not have enough cache to handle such large sized writes.

As always, it is best to test and see how it affects performance.

Saturday, March 7, 2009

Linux, Oracle RAC and the Bonding conundrum

I recently ran into a problem wherein after patching a 4 node Linux RAC cluster, 2 out of 4 instances would not startup. I could not start the instances manually either.

Both querying and trying to restart the crs hung.

On closer look, the crsd.log had entries like

------------------------------------------------------
2009-03-05 23:38:28.600: [ CRSRTI][2541399584]0CSS is not ready. Received status 3 from CSS. Waiting for good status ..

2009-03-05 23:38:29.814: [ COMMCRS][1084229984]clsc_connect: (0xb76f00) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_tus1dwhdbssex04_CODSS))

2009-03-05 23:38:29.814: [ CSSCLNT][2541399584]clsssInitNative: connect failed, rc 9


------------------------------------------------------
I checked the OCSS log and it had entries like the below:

-------------------------------------------------------
CSSD]2009-03-05 23:40:01.566 [1273047392] >ERROR: clssgmSlaveCMSync: reconfig timeout on master 1
[ CSSD]2009-03-05 23:40:01.566 [1273047392] >TRACE: clssgmReconfigThread: completed for reconfig(16), with status(0)
[ CSSD]2009-03-05 23:44:31.610 [2538397376] >ERROR: clssgmStartNMMon: reconfig incarn 16 failed. Retrying.
----------------------------------------------------------


Searching metalink showed no hits, neither did google.

The alertlog for CRS showed that the voting disks was online and the system showed that the interconnects were up. The cluster was setup to use Network interfaces which were bonded together in an Active-Passive mode ( mode=1).

I tried the usual methods - deleting the pipes in /var/tmp/.oracle and restarting the systems a couple of times, however it did not seem to fix the problem. Any attempts to restart crsd from /etc/init.d/init.crsd or crsctl failed. The commands would simply hang and so the nodes had to be force restarted - either by reboot or killing the crs/ocss daemons.

Finally, I checked the active member of the bond interface for the cluster interconnect and found that on 2 of the nodes, the active interface was different from the other 2 nodes. You can identify the active interface by looking into the /proc/net/bonding/bond file.

---------------------
[oracodss@tus1dwhdbssex02] /apps/oracle/home/crs10203/log/tus1dwhdbssex02$ cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v2.6.3 (June 8, 2005)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth5
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth1
MII Status: up
Link Failure Count: 5
Permanent HW addr: 00:1c:23:bf:6e:73

Slave Interface: eth5
MII Status: up
Link Failure Count: 5
Permanent HW addr: 00:15:17:49:75:33
-------------------------------------------------

For the 2 nodes which were up, eth5 was the primary slave, whereas for the 2 down nodes, eth1 was the primary.

So I deleted all the pipes in /var/tmp/oracle, rebooted the 2 down nodes and changed the primary slaves on these 2 nodes to eth5. The instances came back up immediately.

You can change the primary slave using ifenslave or by hard-coding the interface when bringing up the bond interface.

# ifenslave -c bond1 eth5

------modprobe.conf-----------
install bond1 /sbin/modprobe bonding -o bond1 miimon=100 mode=1 primary=eth5
--------------------------------

Active-Passive mode is a common method of bonding in which there is no load-balancing, but only failover (in case a link fails). In a typical HA environment, each interface on the bond is setup to be cabled to a different switch which provides switch level failover capabilities as well. In the Active-Passive mode, the passive link does not even arp, so you would not see any MAC addresses on the switch port for this interface. There is complete isolation between interface 1 and 2.

Imagine a scene wherein Node A uses eth1 as Active and Node B uses eth5 as Active. Even though the Nodes are cabled to the same switches and the links are up/active, Node A will not be able to communicate to Node B.

Here is the funny part - If you setup the cluster interconnect to point to a non-existent interface, I have seen Oracle using the public interface for cluster communications. You will see entries in the instance alert logs mentioning that the cluster interconnect is non-existent/incorrect and so using the public interface.

However, if you set the cluster interconnect to an available and up interface and it is not able to reach the other nodes using the cluster interconnect, it does not try to use the public interface to check connectivity (in the scenario that the vote disks are online/available). Maybe this is a bug? Not surprising at all.

One would assume that a failure of interconnects when the voting disks are online/available should result in more legitimate error logs other than cryptic messages that fill up the CRS logs. This is another kind of split brain condition and probably should be well documented.
Maybe Oracle needs to learn from VCS or other HA products which have been in the market longer and are more stable.