Tuesday, December 30, 2008

Host/Lun Queue Depths, Command Tag Queuing from an oracle perspective

There was an interesting discussion in linkedin as to how a performance/capacity planning engineer can be of value, especially in an economy which is trending downward.

There was a comment that todays infrastructures are well below capacity and under-utilized. This unfortunately, is the truth in most environments I have seen. Most environments do not even do any kind of pro-active performance analysis/capacity planning.  

Without any kind of definitive proof,  if there is a performance issue, the first comment one can expect to hear is that the hardware needs to be upgraded.  And if in datawarehousing, then the storage array is generally the first in the line.  

This is because one of the most common waits (apart from CPU Time) from a datawarehousing perspective would sequential/scattered/direct-path reads, direct path writes and for logging operations - log file parallel writes.

In my experience, most I/O issues are configuration related rather than an under performing array.  One needs to look at the entire I/O subststem and optimize it for oracle requirements.

A complete storage subsystem would consist of all the layers starting with an oracle read/write call - 

 Oracle Read/Write call  -->  Filesystem --> Volume Manager --> Multipathing --> scsi driver--> HBA ---> Array Cache on controller --> Processed by Array controller --> Lun

One aspect of storage that is often misunderstood is the Queue Depth and how it can impact Async I/O.  

To start with, to refresh memories, 

Async I/O - From the Oracle 10g performance tuning guide

"With synchronous I/O, when an I/O request is submitted to the operating system, the writing process blocks until the write is confirmed as complete. It can then continue processing.

With asynchronous I/O, processing continues while the I/O request is submitted and processed. Use asynchronous I/O when possible to avoid bottlenecks."

From the 10g reference guide -

"Parallel server processes can overlap I/O requests (Async I/O) with CPU processing during table scans."

In the simplest of terms, Async I/O is non-blocking I/O. The session can submit I/O requests and wait for confirmation that the request has been received. Once acknowledged, then it can continue on with other activities such as CPU processing. 

Async I/O is enabled by using the ODM interface or setting the FILESYSTEMIO_OPTIONS to SETALL.  SETALL enables both Async and Direct I/O. Direct I/O bypasses the filesystem buffer cache when doing reads/writes.  

I have no experience with using filesystemio_options as all the instances I have been working with have used VRTS ODM. In order to see if you are indeed using ODM interface, a quick and simple check would to verify ODM stats in /dev/odm/stats. I would assume that if using filesystemio_options, a truss or strace would reveal aio_read/write calls.

Queue Depth -  It is the number of outstanding I/O requests to a device for which a response has not been received from the device. There are 2 parts to queue depth - host queue depth and lun queue depth.

  • Host queue depth - Number of transactions waiting to be submitted to a device. 
  • Lun queue depth - Number of transactions submitted to and being processed by the device.

Queue depth = host queue depth + lun queue depth

Asynchronous I/O requests are issued by the processes and this in turn is submitted to the lun. Depending on how lun queue depth has been configured, the requests are split into the host queue and lun queue. 

For e.g. as shown in below figure -

We have 4 parallel processes issuing reads to the same lun. The lun queue depth has been set to 2. This means that out of the 4 reads, 2 reads would be submitted to the lun immediately whereas the other 2 would be in the host queue. The 2 requests in the host queue are waiting to be moved to the lun queue as soon as the lun queue is freed up. In order to track the requests back to the requesting process, command tag queing is employed.  

Command Tag Queuing -  Tagging a I/O request in the lun queue allows the kernel to associate the specific I/O request with the requestor. This in turn allows the SCSI device to disconnect from the host and process the submitted I/O requests. This allows for better bandwidth utilization on the PCI bus.

Command Tag Queuing also can specify where exactly in the queue you want the new IO request to be placed - at the tail end, head end or to be executed in a specific order.

In the above figure,  each of the 2 requests submitted to the lun is tagged so that it can be tied back to the original requestor (parallel process).  At the array level, these 2 requests to the lun are sorted/merged (re-ordering) to ensure optimal head movement when submitting to the actual physical devices behind the lun. 

You can see the lun queue depth and host queue depth using iostat.

mkrishna@oradb:> iostat -xnM |more
  extended device statistics  
  r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
  0.1 0.6 0.0 0.0 0.0 0.0 0.0 22.4 0 0 c1t10d0

wait - average number of transactions waiting for service (queue length) or host queue depth.
actv - average number of transactions actively being serviced (removed from the queue but not yet completed)  or lun queue depth.

Also sar -d will show you the host queue.

mkrishna@oradb:> sar -d |more

SunOS tus1dwhdbspex01 5.9 Generic_122300-16 sun4u 01/02/2009

00:00:00 device %busy avque r+w/s blks/s avwait avserv

00:05:00 sd9 0 0.0 1 16 0.0 23.2

avque - average number of requests outstanding during that time (host queue)

avwait - the average time spent on the avque before it can be moved to the lun queue.

avserv - the average service time in milliseconds for the device.

From what I have observed (atleast on Solaris), sar seems to be more accurate in reporting the host queue than iostat. 

Depending on the lun queue depth that has been configured, it is very much possible that many I/O requests are simply sitting in the host queue waiting to be moved into the lun queue so that they can be serviced. The wait column in iostat or the avque column in sar -d would give you the exact number of requests in the host queue.

For optimal Async I/O,  lun queue depths must be set high enough so that process I/O requests are not waiting in the host queue.  It makes sense to push the host queue onto the lun queue because the array can act on these requests and do a sort/merge (as possible) rather than simply sitting in the host queue and doing nothing. Bigger lun queue depths means the array has more requests in the pipe-line which it can act upon aggresively to optimize head movement. The lun queue depth has significant impact on the throughput. 

But too high lun queue depths and you will start seeing scsi reset error messages on the system.  So you need to strike a balance between too high and too low.

Coming back to the problem definition, tradionally Storage vendors and Unix sys admins recommend setting the lun queue depth to ridiculously low values.  This is because storage vendors never disclose the total number of outstanding requests that can be serviced by their controllers. They take worst case scenarios (maximum hosts/port and all submitting requests at the same time) and make a rule that the maximum outstanding requests/lun can not exceed 8.

This is the basis for the sd_max_throttle set to 8 on many sun systems restricting the lun queue depth to 8.  The default for the sd_max_throttle is 256 (which should be the max ever set).

It makes more sense to restrict the queue depths at the HBA level rather than the sd level (keeping sd_max_throttle at the default).

For e.g, the emulex lpfc drivers can be configured (lpfc.conf) to have per target and per lun restrictions on the queue depth.  You can set both lun and target level queue depths.  The values depend on the array, raid group configuration and lun breakdown, number of hosts/port etc.  


# lun-queue-depth [1 to 128] - The default value lpfc will use to
# limit the number of outstanding commands per FCP LUN. This value
# is global, affecting each LUN recognized by the driver, but may be
# overridden on a per-LUN basis (see below). RAID arrays may want
# to be configured using the per-LUN tunable throttles.

# tgt-queue-depth [0 to 10240] - The default value lpfc will use to
# limit the number of outstanding commands per FCP target. This value
# is global, affecting each target recognized by the driver, but may be
# overridden on a per-target basis (see below). RAID arrays may want
# to be configured using the per-target tunable throttles. A value
# of 0 means don't throttle the target.


Every environment is different and so the optimal queue depths would differ. One needs to test, monitor using iostat/sar and see which works best. For our datawarehousing environments,  I normally set the lun queue depth to 30 and target queue depth to 256. With these settings, I have not seen many pending requests (5-10 during maximum load) in the host queue for our environments.   Datawarehousing mostly consisting of lesser number of large I/O requests rather than OLTP environments,  these values(30 and 256) are mildy conservative in nature. 

The arrays we use are Hitachi Modular storage (AMS1000/9585).  Arrays are shared between the datawarehouse instances (not more than 3 instances/array) and each instance is assigned storage on multiple arrays.  Hitachi Modular arrays are mid-range and really are not high on the specs (8/16GB Cache, 8 ports, 2 controllers (Active/Passive)).  


Anonymous said...

Thank you for the clear concise explanation. It was most appreciated.

Taral said...

Very interesting cleared many concepts. Bare me if i ask some basic question but i am new to this

Lets say we have one query going on and as you said if i am understanding right it goes to SAN cache through LUN and if LUN in not properly configured we see I/O waits. Then how to map a process(pid) to the lun. Say we have 4 LUN but our process is only using 2 and other are free so we are not using that and this is wastage.

And also how to utilize this 4 Lun in this condition where it using only 2

SSK said...

Hi Taral,

Thank you for the comment. As long as the data that is being read by your application/process resides on the 4 luns, then the 4 luns will be used. If the data resides on only 2 luns, then only 2 would be used.

So you would need to plan appropriately and place your datafiles on all available luns to spread the IO.