Sunday, March 8, 2009

Temporary Tablespaces on Oracle - going the extra mile

During a recent performance debug session on a Linux 10.2.0.3 RAC cluster, I noticed that reads/writes to the temp tablespace were occurring at the rate of 240K/IOP. Even though direct path reads were happening at 1024K/IOP, IOP size to temp tablespaces did not exceed 240K.

I noticed that IO sizes varied between 48K and 240K. A 10046 trace also revealed that Oracle was issuing reads/writes at the max of 15 blocks and min of 3 blocks. The db block size was 16K and thus 15 blocks equals 240K IOP size and 3 blocks equals 48K.

At first, I thought this was a limitation on the raw volumes which we were using for the temporary tablespaces, but a test using dd to the raw volumes showed that we could go upto 4M/IOP without the IO request getting broken up (this was the limit set at the volume level).

I opened a tar with Oracle and it did not get me any where. I found it odd that Oracle would use 15 block sized IOPs. 15 is an odd number and not Oracle style. Just by sheer luck, I was going through the hidden parameters list and came across 2 parameters

_smm_auto_max_io_size 240 Maximum IO size (in KB) used by sort/hash-join in auto mode

_smm_auto_min_io_size 48 Minimum IO size
(in KB) used by sort/hash-join in auto mode

I had found the problem - Oracle had set hash-joins and sorts that spill to the disk to not exceed 240KB. The minimum IO size was 48K and the maximum was 240KB. It fit in perfectly with the behavior I noticed.

Note that the parameter mentions "in auto mode" - so if you work area policy is set to auto, then these settings will take effect. The older settings - _hash_multiblock_io_count/_sort_multiblock_read_count has no effect if using auto work area policy and probably still control the hash/sort behaviour when in manual mode.

To me, 240K and 48K are on the lower end of a suitable size for an IOP. I could understand as to why Oracle would keep it at such low values - IO to the temp tablespaces ( if shared) is very much random and with bigger IOP sizes, it can cause a slowdown of your operation. But if your temp tablespaces are well designed, then you could easily use bigger sized IOP requests.

With bigger sized requests, you would reduce the number of waits. But there is a catch with using bigger sized requests - you will end up spending more time on each wait. There will be a sweet spot wherein the net result will be to your choice.

Our Linux RAC cluster was using temporary tablespace groups (4 tablespaces) and had a well designed IO subsystem. Each temp tablespace had a single temp file (raw volume) of 62GB and there were 4 such temp tablespaces. Being a 4 node RAC cluster, we had configured 4 tablespaces assuming that each node would prefer a separate tablespace (node affinity).

Each of the raw volumes was a Raid 1+0 configuration using 4*143GB, 15K RPM drives. The array was an Hitachi AMS1000 with 16GB of cache. The stripe width was 1M on the Raid 1+0 volume. There was a lot of space wasted, however, we were interested in performance and so space was not a concern. So ~250GB of Temp space was laid out across 16 drives (143GB, 15K RPM FC).

In order to test the performance with different sized IOPS, I took one of our "Taste the Thunder" sql's which does an insert of ~460 million rows after doing some large hash-joins on tables with + billion rows. Normally this sql takes around 27 minutes to complete. We do parallel dml and the insert is split across all the 4 nodes.

With the default settings of 240K/48K. (click to enlarge)

Changing max/min to 1024K/512K (click to enlarge)


Changing max/min to 1024K/1024K (click to enlarge)


The below summarizes the findings:

The best results where when we set the max/min to 1024K/512K. We had an improvement of 22% over the default.

As expected, with increasing IOP sizes, the number of waits reduce, however the average wait time increases. The sweet spot in our case was with a max/min setting of 1024K/512K.

With a max/min setting of 1024K/1024K, even though the reads were very fast, the write times increased exponentially - probably the array did not have enough cache to handle such large sized writes.

As always, it is best to test and see how it affects performance.

4 comments:

Anonymous said...

I ran into similar problem. Tried increasing both parameters but I still see lot of direct path read temp with p3=1. Good news is I also see few large IO's of 1 MB ie p3=64.

Vishal Desai said...

I tried both parameters and I still see lot of direct path read temp with p3=1 (ie 16k IO's). I also see large IO's with p3=64 but my overall throughput remains same.

Vishal Desai said...

After changing these parameters I still see lot of direct path read with p3=1 (ie 16k reads). I see large direct path reads as well p3=64 but overall throughput is same.

Krishna Manoharan said...

Hi Vishal,

After you changed the parameters, did the number of waits reduce?

Also was the test a hash join or a sort?

Throughput and latency would depend on storage subsystem capabilities.

Thanks
Krishna