Tuesday, December 30, 2008

Host/Lun Queue Depths, Command Tag Queuing from an oracle perspective

There was an interesting discussion in linkedin as to how a performance/capacity planning engineer can be of value, especially in an economy which is trending downward.

There was a comment that todays infrastructures are well below capacity and under-utilized. This unfortunately, is the truth in most environments I have seen. Most environments do not even do any kind of pro-active performance analysis/capacity planning.  

Without any kind of definitive proof,  if there is a performance issue, the first comment one can expect to hear is that the hardware needs to be upgraded.  And if in datawarehousing, then the storage array is generally the first in the line.  

This is because one of the most common waits (apart from CPU Time) from a datawarehousing perspective would sequential/scattered/direct-path reads, direct path writes and for logging operations - log file parallel writes.

In my experience, most I/O issues are configuration related rather than an under performing array.  One needs to look at the entire I/O subststem and optimize it for oracle requirements.

A complete storage subsystem would consist of all the layers starting with an oracle read/write call - 

 Oracle Read/Write call  -->  Filesystem --> Volume Manager --> Multipathing --> scsi driver--> HBA ---> Array Cache on controller --> Processed by Array controller --> Lun

One aspect of storage that is often misunderstood is the Queue Depth and how it can impact Async I/O.  

To start with, to refresh memories, 

Async I/O - From the Oracle 10g performance tuning guide

"With synchronous I/O, when an I/O request is submitted to the operating system, the writing process blocks until the write is confirmed as complete. It can then continue processing.

With asynchronous I/O, processing continues while the I/O request is submitted and processed. Use asynchronous I/O when possible to avoid bottlenecks."

From the 10g reference guide -

"Parallel server processes can overlap I/O requests (Async I/O) with CPU processing during table scans."

In the simplest of terms, Async I/O is non-blocking I/O. The session can submit I/O requests and wait for confirmation that the request has been received. Once acknowledged, then it can continue on with other activities such as CPU processing. 

Async I/O is enabled by using the ODM interface or setting the FILESYSTEMIO_OPTIONS to SETALL.  SETALL enables both Async and Direct I/O. Direct I/O bypasses the filesystem buffer cache when doing reads/writes.  

I have no experience with using filesystemio_options as all the instances I have been working with have used VRTS ODM. In order to see if you are indeed using ODM interface, a quick and simple check would to verify ODM stats in /dev/odm/stats. I would assume that if using filesystemio_options, a truss or strace would reveal aio_read/write calls.

Queue Depth -  It is the number of outstanding I/O requests to a device for which a response has not been received from the device. There are 2 parts to queue depth - host queue depth and lun queue depth.

  • Host queue depth - Number of transactions waiting to be submitted to a device. 
  • Lun queue depth - Number of transactions submitted to and being processed by the device.

Queue depth = host queue depth + lun queue depth

Asynchronous I/O requests are issued by the processes and this in turn is submitted to the lun. Depending on how lun queue depth has been configured, the requests are split into the host queue and lun queue. 

For e.g. as shown in below figure -

We have 4 parallel processes issuing reads to the same lun. The lun queue depth has been set to 2. This means that out of the 4 reads, 2 reads would be submitted to the lun immediately whereas the other 2 would be in the host queue. The 2 requests in the host queue are waiting to be moved to the lun queue as soon as the lun queue is freed up. In order to track the requests back to the requesting process, command tag queing is employed.  





Command Tag Queuing -  Tagging a I/O request in the lun queue allows the kernel to associate the specific I/O request with the requestor. This in turn allows the SCSI device to disconnect from the host and process the submitted I/O requests. This allows for better bandwidth utilization on the PCI bus.

Command Tag Queuing also can specify where exactly in the queue you want the new IO request to be placed - at the tail end, head end or to be executed in a specific order.

In the above figure,  each of the 2 requests submitted to the lun is tagged so that it can be tied back to the original requestor (parallel process).  At the array level, these 2 requests to the lun are sorted/merged (re-ordering) to ensure optimal head movement when submitting to the actual physical devices behind the lun. 

You can see the lun queue depth and host queue depth using iostat.

mkrishna@oradb:> iostat -xnM |more
  extended device statistics  
  r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
  0.1 0.6 0.0 0.0 0.0 0.0 0.0 22.4 0 0 c1t10d0

wait - average number of transactions waiting for service (queue length) or host queue depth.
 
actv - average number of transactions actively being serviced (removed from the queue but not yet completed)  or lun queue depth.

Also sar -d will show you the host queue.

mkrishna@oradb:> sar -d |more

SunOS tus1dwhdbspex01 5.9 Generic_122300-16 sun4u 01/02/2009

00:00:00 device %busy avque r+w/s blks/s avwait avserv

00:05:00 sd9 0 0.0 1 16 0.0 23.2

avque - average number of requests outstanding during that time (host queue)

avwait - the average time spent on the avque before it can be moved to the lun queue.

avserv - the average service time in milliseconds for the device.

From what I have observed (atleast on Solaris), sar seems to be more accurate in reporting the host queue than iostat. 

Depending on the lun queue depth that has been configured, it is very much possible that many I/O requests are simply sitting in the host queue waiting to be moved into the lun queue so that they can be serviced. The wait column in iostat or the avque column in sar -d would give you the exact number of requests in the host queue.

For optimal Async I/O,  lun queue depths must be set high enough so that process I/O requests are not waiting in the host queue.  It makes sense to push the host queue onto the lun queue because the array can act on these requests and do a sort/merge (as possible) rather than simply sitting in the host queue and doing nothing. Bigger lun queue depths means the array has more requests in the pipe-line which it can act upon aggresively to optimize head movement. The lun queue depth has significant impact on the throughput. 

But too high lun queue depths and you will start seeing scsi reset error messages on the system.  So you need to strike a balance between too high and too low.

Coming back to the problem definition, tradionally Storage vendors and Unix sys admins recommend setting the lun queue depth to ridiculously low values.  This is because storage vendors never disclose the total number of outstanding requests that can be serviced by their controllers. They take worst case scenarios (maximum hosts/port and all submitting requests at the same time) and make a rule that the maximum outstanding requests/lun can not exceed 8.

This is the basis for the sd_max_throttle set to 8 on many sun systems restricting the lun queue depth to 8.  The default for the sd_max_throttle is 256 (which should be the max ever set).

It makes more sense to restrict the queue depths at the HBA level rather than the sd level (keeping sd_max_throttle at the default).

For e.g, the emulex lpfc drivers can be configured (lpfc.conf) to have per target and per lun restrictions on the queue depth.  You can set both lun and target level queue depths.  The values depend on the array, raid group configuration and lun breakdown, number of hosts/port etc.  

------------CUT--------------

# lun-queue-depth [1 to 128] - The default value lpfc will use to
# limit the number of outstanding commands per FCP LUN. This value
# is global, affecting each LUN recognized by the driver, but may be
# overridden on a per-LUN basis (see below). RAID arrays may want
# to be configured using the per-LUN tunable throttles.
lun-queue-depth=30;

# tgt-queue-depth [0 to 10240] - The default value lpfc will use to
# limit the number of outstanding commands per FCP target. This value
# is global, affecting each target recognized by the driver, but may be
# overridden on a per-target basis (see below). RAID arrays may want
# to be configured using the per-target tunable throttles. A value
# of 0 means don't throttle the target.
tgt-queue-depth=256;

--------------CUT-------------

Every environment is different and so the optimal queue depths would differ. One needs to test, monitor using iostat/sar and see which works best. For our datawarehousing environments,  I normally set the lun queue depth to 30 and target queue depth to 256. With these settings, I have not seen many pending requests (5-10 during maximum load) in the host queue for our environments.   Datawarehousing mostly consisting of lesser number of large I/O requests rather than OLTP environments,  these values(30 and 256) are mildy conservative in nature. 

The arrays we use are Hitachi Modular storage (AMS1000/9585).  Arrays are shared between the datawarehouse instances (not more than 3 instances/array) and each instance is assigned storage on multiple arrays.  Hitachi Modular arrays are mid-range and really are not high on the specs (8/16GB Cache, 8 ports, 2 controllers (Active/Passive)).  

Sunday, December 28, 2008

Analyzing the impact of the Vxfs filesystem block size on Oracle

I am usually asked as to what should be the ideal Vxfs filesystem block size for an Oracle DB block size of 16K. I always reply - 8K (maximum on Vxfs).

All along, my reasoning was that if using say 1K filesystem block size, then a 16K oracle block read would end up as 16/1K IO requests to the filesystem and the same for writes. With a filesystem block size of 8K, you would be reduced from 16/1K requests to 16/8K requests - or so I thought..

I decided to test to see what exactly was happening and it proved that I was wrong – at least with respect to Vxfs.

Firstly some background about Vxfs –

Vxfs is an extent based filesystem – meaning it allocates space to files not as blocks, but as extents. Extents are contiguous set of filesystem blocks. Extent sizes vary and also the method of creation of a file greatly influences extent sizing. As a file grows, more extents are added to the file. 

The interesting part about Vxfs and extents is that IO is never split across extents and a request for contiguous set of blocks within an extent is satisfied with a single request. If split across extents, then it will result in multiple IO requests – quite similar to how db file scattered read would split a request between oracle extents.  From the Vxfs guide -

"By allocating disk space to files in extents, disk I/O to and from a file can be done in units of multiple blocks. This type of I/O can occur if storage is allocated in units of consecutive blocks. For sequential I/O, multiple block operations are considerably faster than block-at-a-time operations. Almost all disk drives accept I/O operations of multiple blocks."

So coming back to Oracle – some test scenarios

I decided to test and see for myself.

The environment is Solaris 9 on a E4900 with Storage Foundation for Oracle Enterprise Edition. Oracle is 10.2.0.3 using VRTS ODM. 

I created 2 tablespaces – one on a 1K filesystem and the other on a 8K filesystem. Each had 1 datafile of size 5g. 


Identical tables with ~1000 rows were created on both the tablespaces.  Indexes were created on both tables on relevant columns.

On a 1K Filesystem block size and a 16K DB Block size

First to confirm 1K block size

root@oracle:> fstyp -v /dev/vx/rdsk/oracledg/test1k
vxfs
magic a501fcf5 version 6 ctime Sat Dec 27 22:52:33 2008
logstart 0 logend 0
bsize 1024 size 15728640 dsize 0 ninode 15728640 nau 0
defiextsize 0 ilbsize 0 immedlen 96 ndaddr 10
aufirst 0 emap 0 imap 0 iextop 0 istart 0

I initiated both sequential and scattered reads on the tables.

A vxtrace showed that oracle was issuing requests for 16K or bigger sized requests and they were single IOs. They were not broken up into smaller IO requests as one would have normally expected.  I could not use truss because IO requests show up as ioctl calls when using ODM. There was no read I/O smaller than 32 blocks (16K) thus confirming that IOs are not split based on filesystem blocks.

------------------------------------------
1254 START read vol test1k op 0 block 4326176 len 32 <----- 16K Reads
1254 END read vol test1k op 0 block 4326176 len 32 time 0

--------CUT---------

1260 START read vol test1k op 0 block 4326048 len 128 <------ 64K Reads
1260 END read vol test1k op 0 block 4326048 len 128 time 0
1261 START read vol test1k op 0 block 4326176 len 32
1261 END read vol test1k op 0 block 4326176 len 32 time 0
1262 START read vol test1k op 0 block 4325792 len 128
1262 END read vol test1k op 0 block 4325792 len 128 time 0

------------CUT------------------------------

On a 8K Filesystem block size and a 16K DB Block size

To confirm the block size is indeed 8k

root@oracle:> fstyp -v /dev/vx/rdsk/edwrsdg/test8k
vxfs
magic a501fcf5 version 6 ctime Sat Dec 27 22:52:47 2008
logstart 0 logend 0
bsize 8192 size 655360 dsize 0 ninode 655360 nau 0
defiextsize 0 ilbsize 0 immedlen 96 ndaddr 10
aufirst 0 emap 0 imap 0 iextop 0 istart 0

I did the same set of reads as done for the 1k filesystem and it was the same.

------------CUT-----------

1265 START read vol test1k op 0 block 4326048 len 128 <------ 64K reads
1265 END read vol test1k op 0 block 4326048 len 128 time 0
1266 START read vol test1k op 0 block 4326176 len 32 <--------- 16K reads
1266 END read vol test1k op 0 block 4326176 len 32 time 0
1267 START read vol test1k op 0 block 4325888 len 32
1267 END read vol test1k op 0 block 4325888 len 32 time 0

------------CUT----------------

So the reads behave exactly like how it is documented. Oracle will do reads only in multiples of db block sizes. On either a 1K or 8k Vxfs block filesystem, a 16K or multiples of 16K reads would be sequential reads of contigous blocks and hence be satisfied from within a single IO request - as long as the IO request can be met from a single extent. 

So from an IO perspective, it really does not matter if using 1K or 8K.

Now there is other aspect to this - file system overhead, fragmentation, extent sizing and space management.

1K filesystem block size would reduce space wastage at a cost of having to manage a lot many blocks (filesytem overhead) whereas 8K filesystem block size would be ideal for an oracle instance using a DB block size of 8K or higher.

From a filesystem management perspective, using 8K filesystem block size makes better sense as Oracle would not ever store data in a size less than the DB Block size. An 8K filesystem block size reduces the number of blocks and correspondingly the filesystem overhead in maintaining these blocks.  I do not know if anyone uses a 4K DB Block size any more. All I have seen are 8K and higher. 

To reduce fragmentation,  it is best if the datafile is using a single extent (as will be when created on a database using VRTS ODM).  The extent here refers to the Vxfs Extents and not Tablespace extents. To maintain as a single Vxfs extent, datafiles should never be extended and always new datafiles should be added to increase tablespace capacity.

You can find out the extents allocated to a file by running vxstorage_stats - it is an invaluable tool.  Fragmentation status can be identified by running fsadm. Normally when using ODM, fragmentation should be minimal.

Wednesday, December 24, 2008

_realfree_heap_pagesize_hint - Assessing the impact on Solaris and Linux

The _realfree_heap_pagesize_hint in 10g provides a mechanism by which the process private memory (PGA) can use bigger memory page sizes and thus reduce TLB/TSB misses. This parameter is set in bytes.

This is especially important for Datawarehousing wherein a session can consume significant amount of anonymous memory and in many cases the workarea is bigger than the SGA.

I wrote about TLB/TSB misses from an oracle perspective in an earlier blog here.

http://dsstos.blogspot.com/2008/11/assessing-tlbtsb-misses-and-page-faults.html

 
This parameter is designed to work on the Solaris platform only, however it does work partially on Linux too and probably the same way on other platforms.

As per this hint,

  1. memory extents within the heap would be in _realfree_heap_pagesize_hint chunks.
  2. And these chunks with the memcntl(2) call,  be in _realfree_heap_pagesize_hint sized OS page (provided the pagesize is a valid choice).

For e.g. - An extent of 16MB would be carved upto into 4MB chunks and each 4M chunk would mapped to an individual 4M OS memory page (if the _realfree_heap_pagesize_hint = 4M).

Solaris:


Solaris supports four page sizes on the UltraSparc IV+ platform (8K-default, 64K, 512K and 4M). The default setting for the _realfree_heap_pagesize_hint is 65536 or 64K.

In order to test this parameter, I did a sort on a un-indexed table with approx 3.8 million rows. The avg row length was ~243 bytes and the table approx 1GB in size. The reason I selected such a big table was also to see how memory utilization changed when using different page sizes.

_realfree_heap_pagesize_hint at 65536 (Default)

This implies that when a session requests anon memory, oracle will use 64K pages. However this did not seem to be true. With a setting of 65536, only 8K pages were used.

I did a truss of the shadow process when doing the sort and this is what I observed.

-----------CUT-------------------

19167/1: 5.5795 mmap(0x00000000, 2097152, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE, 8, 3080192) = 0xFFFFFFFF7A5F0000
19167/1: 5.5796 mmap(0xFFFFFFFF7A5F0000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF7A5F0000
19167/1: 5.5813 mmap(0xFFFFFFFF7A600000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF7A600000
19167/1: 5.5829 mmap(0xFFFFFFFF7A610000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF7A610000
19167/1: 5.5846 mmap(0xFFFFFFFF7A620000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF7A620000
19167/1: 5.5863 mmap(0xFFFFFFFF7A630000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF7A630000

------------------CUT-------------------------------------

As you can see, the extent of size 2M was requested with MAP_NORESERVE and then into 64K chunks. However there is no accompaning memcntl(2) request to ask the OS to allocate 64K pages for the chunks. This is also confirmed when using pmap/trapstat.

trapstat not showing usage of any 64K pages.




pmap output showing anon pages using 8k page size.




Changing the _realfree_heap_pagesize_hint to 512K

Changing the hint to 512K shows that it indeed requests 512K pages from the OS.

------------CUT-------------

19277/1: 14.6646 mmap(0x00000000, 4718592, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE, 8, 7864320) = 0xFFFFFFFF79780000
19277/1: 14.6647 munmap(0xFFFFFFFF79B80000, 524288) = 0
19277/1: 14.6648 mmap(0xFFFFFFFF79780000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF79780000
19277/1: 14.6649 memcntl(0xFFFFFFFF79780000, 524288, MC_HAT_ADVISE, 0xFFFFFFFF7FFF7EC0, 0, 0) = 0
19277/1: 14.6909 mmap(0xFFFFFFFF79800000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF79800000
19277/1: 14.6910 memcntl(0xFFFFFFFF79800000, 524288, MC_HAT_ADVISE, 0xFFFFFFFF7FFF7F80, 0, 0) = 0

---------------CUT-----------------------

As you can see, there is the memcntl(2) call being issued to request the OS to allocate 512K page size. This is also correlated by trapstat and pmap.

trapstat output showing TLB/TSB misses for 512K pages.




pmap output for anon pages showing 512K pages being used.




Changing the _realfree_heap_pagesize_hint to 4M

Changing the hint to 4M also shows that the pagesize being requested is 4M.

Truss output -

-------------------CUT-----------------------

18995/1: 34.0445 mmap(0x00000000, 20971520, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE, 8, 390070272) = 0xFFFFFFFF53000000
18995/1: 34.0447 munmap(0xFFFFFFFF54000000, 4194304) = 0
18995/1: 34.0448 mmap(0xFFFFFFFF53000000, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF53000000
18995/1: 34.0449 memcntl(0xFFFFFFFF53000000, 4194304, MC_HAT_ADVISE, 0xFFFFFFFF7FFF7EE0, 0, 0) = 0


-----------------CUT-------------------------


Trapstat output confirming usage of 4M pages for anon memory



And finally pmap output.




So we know now that this does work as expected except for the default setting of 64K. So how does this affect performance?

  1. By using bigger page sizes, we can store more virtual to physical entries in the TLB/TSB and reduce TLB/TSB misses.
  2. Also if using bigger page sizes, it results in a reduction in the number of mmap requests thus reducing CPU spent on system time. For e.g - a 4M extent would require 512 mmap requests if using the default 8K pages, but only 1 mmap request if using a 4M page size.
  3. So memory requests can be addressed significantly faster if using bigger page sizes.
  4. However with bigger pages, one would expect that memory utilization will also go up. The basic denominator for memory requests being page sizes (8K, 512K or 4M), it is possible that there will be memory wastage.

In order to check for memory wastage, I checked the v$sql_workarea_active along with session pga/uga stats to identify how much memory was consumed with different page size setting. By sizing the PGA and using _smm_max_size appropriately, I ensured that the sort completes optimally and in memory without spilling to disk.

With the default setting of 64K

Time taken to complete - 30-32 seconds
Workarea Memory used - 1085.010 MB
session pga memory - 1102.92 MB
session uga memory - 1102.3 MB

With 512K
Time taken to complete - 24-28 seconds
Workarea Memory used - 1085.010 MB
session pga memory - 1103.73 MB
session uga memory - 1102.2 MB

With 4M

Time taken to complete - 24-27 seconds
Workarea Memory used - 1085.010 MB
session pga memory - 1112.2 MB
session uga memory - 1103.99 MB

Looking at the above stats, for the same sort operation requiring 1GB of workarea, the PGA usage is a fraction higher (~1%) with bigger page sizes. This may impact very big sorts or when multiple sessions running simultaneously - especially when doing parallel operations, so there is always the chance that you may end up with ora-4030 errors if you do not configure your instance appropriately.

Theoretically the timings should improve because of the lesser number of mmap operations and also reduced TLB/TSB misses. All in all, it probably makes sense to use this feature to enable bigger page sizes for Datawarehousing.

On Linux

On Solaris the _realfree_heap_pagesize_hint works well since four different Page sizes (8K, 64K, 512K and 4M) are supported and can be allocated dynamically. However on Linux, only two page sizes are supported (4K and 2M). The 2M pagesize can be allocated only as huge-pages which is used for the SGA. Huge-pages cannot be used for private process memory.

So in Linux, setting the _realfree_heap_pagesize_hint to bigger values only results in _realfree_heap_pagesize_hint sized chunks within extents, however not mapped to physical memory pages of the same size. Since this reduces the number of mmap requests and thus is better than the default.

With the default setting of 64K

------------CUT-------------

mmap2(NULL, 1048576, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE, 7, 0xf1) = 0xb70f1000
mmap2(0xb70f1000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb70f1000
mmap2(0xb7101000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb7101000

-------CUT-----------

As you can see from above, 64K chunks are requested.

Changing to 4M

----------CUT-----------
mmap2(NULL, 16777216, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE, 7, 0x36f1) = 0xb2af1000
mmap2(0xb2af1000, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb2af1000
mmap2(0xb2ef1000, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb2ef1000
mmap2(0xb32f1000, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb32f1000

---------CUT---------
As you can see from the above, with a setting of 4M, the chunks are 4M sized, however there is no request for a 4M page size as this is not feasible in Linux.

Changing to 8M

I was curious to see how this would play out when changing to 8M.

--------CUT-------------
mmap2(NULL, 16777216, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE, 7, 0x5af1) = 0xb02f1000
mmap2(0xb02f1000, 8388608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb02f1000
mmap2(0xb0af1000, 8388608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb0af1000

--------CUT--------------

The chunks are now 8M in size. I noticed the same behavior in Solaris too (minus memcntl to request an appropriate OS page size).