Tuesday, December 30, 2008

Host/Lun Queue Depths, Command Tag Queuing from an oracle perspective

There was an interesting discussion in linkedin as to how a performance/capacity planning engineer can be of value, especially in an economy which is trending downward.

There was a comment that todays infrastructures are well below capacity and under-utilized. This unfortunately, is the truth in most environments I have seen. Most environments do not even do any kind of pro-active performance analysis/capacity planning.  

Without any kind of definitive proof,  if there is a performance issue, the first comment one can expect to hear is that the hardware needs to be upgraded.  And if in datawarehousing, then the storage array is generally the first in the line.  

This is because one of the most common waits (apart from CPU Time) from a datawarehousing perspective would sequential/scattered/direct-path reads, direct path writes and for logging operations - log file parallel writes.

In my experience, most I/O issues are configuration related rather than an under performing array.  One needs to look at the entire I/O subststem and optimize it for oracle requirements.

A complete storage subsystem would consist of all the layers starting with an oracle read/write call - 

 Oracle Read/Write call  -->  Filesystem --> Volume Manager --> Multipathing --> scsi driver--> HBA ---> Array Cache on controller --> Processed by Array controller --> Lun

One aspect of storage that is often misunderstood is the Queue Depth and how it can impact Async I/O.  

To start with, to refresh memories, 

Async I/O - From the Oracle 10g performance tuning guide

"With synchronous I/O, when an I/O request is submitted to the operating system, the writing process blocks until the write is confirmed as complete. It can then continue processing.

With asynchronous I/O, processing continues while the I/O request is submitted and processed. Use asynchronous I/O when possible to avoid bottlenecks."

From the 10g reference guide -

"Parallel server processes can overlap I/O requests (Async I/O) with CPU processing during table scans."

In the simplest of terms, Async I/O is non-blocking I/O. The session can submit I/O requests and wait for confirmation that the request has been received. Once acknowledged, then it can continue on with other activities such as CPU processing. 

Async I/O is enabled by using the ODM interface or setting the FILESYSTEMIO_OPTIONS to SETALL.  SETALL enables both Async and Direct I/O. Direct I/O bypasses the filesystem buffer cache when doing reads/writes.  

I have no experience with using filesystemio_options as all the instances I have been working with have used VRTS ODM. In order to see if you are indeed using ODM interface, a quick and simple check would to verify ODM stats in /dev/odm/stats. I would assume that if using filesystemio_options, a truss or strace would reveal aio_read/write calls.

Queue Depth -  It is the number of outstanding I/O requests to a device for which a response has not been received from the device. There are 2 parts to queue depth - host queue depth and lun queue depth.

  • Host queue depth - Number of transactions waiting to be submitted to a device. 
  • Lun queue depth - Number of transactions submitted to and being processed by the device.

Queue depth = host queue depth + lun queue depth

Asynchronous I/O requests are issued by the processes and this in turn is submitted to the lun. Depending on how lun queue depth has been configured, the requests are split into the host queue and lun queue. 

For e.g. as shown in below figure -

We have 4 parallel processes issuing reads to the same lun. The lun queue depth has been set to 2. This means that out of the 4 reads, 2 reads would be submitted to the lun immediately whereas the other 2 would be in the host queue. The 2 requests in the host queue are waiting to be moved to the lun queue as soon as the lun queue is freed up. In order to track the requests back to the requesting process, command tag queing is employed.  





Command Tag Queuing -  Tagging a I/O request in the lun queue allows the kernel to associate the specific I/O request with the requestor. This in turn allows the SCSI device to disconnect from the host and process the submitted I/O requests. This allows for better bandwidth utilization on the PCI bus.

Command Tag Queuing also can specify where exactly in the queue you want the new IO request to be placed - at the tail end, head end or to be executed in a specific order.

In the above figure,  each of the 2 requests submitted to the lun is tagged so that it can be tied back to the original requestor (parallel process).  At the array level, these 2 requests to the lun are sorted/merged (re-ordering) to ensure optimal head movement when submitting to the actual physical devices behind the lun. 

You can see the lun queue depth and host queue depth using iostat.

mkrishna@oradb:> iostat -xnM |more
  extended device statistics  
  r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
  0.1 0.6 0.0 0.0 0.0 0.0 0.0 22.4 0 0 c1t10d0

wait - average number of transactions waiting for service (queue length) or host queue depth.
 
actv - average number of transactions actively being serviced (removed from the queue but not yet completed)  or lun queue depth.

Also sar -d will show you the host queue.

mkrishna@oradb:> sar -d |more

SunOS tus1dwhdbspex01 5.9 Generic_122300-16 sun4u 01/02/2009

00:00:00 device %busy avque r+w/s blks/s avwait avserv

00:05:00 sd9 0 0.0 1 16 0.0 23.2

avque - average number of requests outstanding during that time (host queue)

avwait - the average time spent on the avque before it can be moved to the lun queue.

avserv - the average service time in milliseconds for the device.

From what I have observed (atleast on Solaris), sar seems to be more accurate in reporting the host queue than iostat. 

Depending on the lun queue depth that has been configured, it is very much possible that many I/O requests are simply sitting in the host queue waiting to be moved into the lun queue so that they can be serviced. The wait column in iostat or the avque column in sar -d would give you the exact number of requests in the host queue.

For optimal Async I/O,  lun queue depths must be set high enough so that process I/O requests are not waiting in the host queue.  It makes sense to push the host queue onto the lun queue because the array can act on these requests and do a sort/merge (as possible) rather than simply sitting in the host queue and doing nothing. Bigger lun queue depths means the array has more requests in the pipe-line which it can act upon aggresively to optimize head movement. The lun queue depth has significant impact on the throughput. 

But too high lun queue depths and you will start seeing scsi reset error messages on the system.  So you need to strike a balance between too high and too low.

Coming back to the problem definition, tradionally Storage vendors and Unix sys admins recommend setting the lun queue depth to ridiculously low values.  This is because storage vendors never disclose the total number of outstanding requests that can be serviced by their controllers. They take worst case scenarios (maximum hosts/port and all submitting requests at the same time) and make a rule that the maximum outstanding requests/lun can not exceed 8.

This is the basis for the sd_max_throttle set to 8 on many sun systems restricting the lun queue depth to 8.  The default for the sd_max_throttle is 256 (which should be the max ever set).

It makes more sense to restrict the queue depths at the HBA level rather than the sd level (keeping sd_max_throttle at the default).

For e.g, the emulex lpfc drivers can be configured (lpfc.conf) to have per target and per lun restrictions on the queue depth.  You can set both lun and target level queue depths.  The values depend on the array, raid group configuration and lun breakdown, number of hosts/port etc.  

------------CUT--------------

# lun-queue-depth [1 to 128] - The default value lpfc will use to
# limit the number of outstanding commands per FCP LUN. This value
# is global, affecting each LUN recognized by the driver, but may be
# overridden on a per-LUN basis (see below). RAID arrays may want
# to be configured using the per-LUN tunable throttles.
lun-queue-depth=30;

# tgt-queue-depth [0 to 10240] - The default value lpfc will use to
# limit the number of outstanding commands per FCP target. This value
# is global, affecting each target recognized by the driver, but may be
# overridden on a per-target basis (see below). RAID arrays may want
# to be configured using the per-target tunable throttles. A value
# of 0 means don't throttle the target.
tgt-queue-depth=256;

--------------CUT-------------

Every environment is different and so the optimal queue depths would differ. One needs to test, monitor using iostat/sar and see which works best. For our datawarehousing environments,  I normally set the lun queue depth to 30 and target queue depth to 256. With these settings, I have not seen many pending requests (5-10 during maximum load) in the host queue for our environments.   Datawarehousing mostly consisting of lesser number of large I/O requests rather than OLTP environments,  these values(30 and 256) are mildy conservative in nature. 

The arrays we use are Hitachi Modular storage (AMS1000/9585).  Arrays are shared between the datawarehouse instances (not more than 3 instances/array) and each instance is assigned storage on multiple arrays.  Hitachi Modular arrays are mid-range and really are not high on the specs (8/16GB Cache, 8 ports, 2 controllers (Active/Passive)).  

Sunday, December 28, 2008

Analyzing the impact of the Vxfs filesystem block size on Oracle

I am usually asked as to what should be the ideal Vxfs filesystem block size for an Oracle DB block size of 16K. I always reply - 8K (maximum on Vxfs).

All along, my reasoning was that if using say 1K filesystem block size, then a 16K oracle block read would end up as 16/1K IO requests to the filesystem and the same for writes. With a filesystem block size of 8K, you would be reduced from 16/1K requests to 16/8K requests - or so I thought..

I decided to test to see what exactly was happening and it proved that I was wrong – at least with respect to Vxfs.

Firstly some background about Vxfs –

Vxfs is an extent based filesystem – meaning it allocates space to files not as blocks, but as extents. Extents are contiguous set of filesystem blocks. Extent sizes vary and also the method of creation of a file greatly influences extent sizing. As a file grows, more extents are added to the file. 

The interesting part about Vxfs and extents is that IO is never split across extents and a request for contiguous set of blocks within an extent is satisfied with a single request. If split across extents, then it will result in multiple IO requests – quite similar to how db file scattered read would split a request between oracle extents.  From the Vxfs guide -

"By allocating disk space to files in extents, disk I/O to and from a file can be done in units of multiple blocks. This type of I/O can occur if storage is allocated in units of consecutive blocks. For sequential I/O, multiple block operations are considerably faster than block-at-a-time operations. Almost all disk drives accept I/O operations of multiple blocks."

So coming back to Oracle – some test scenarios

I decided to test and see for myself.

The environment is Solaris 9 on a E4900 with Storage Foundation for Oracle Enterprise Edition. Oracle is 10.2.0.3 using VRTS ODM. 

I created 2 tablespaces – one on a 1K filesystem and the other on a 8K filesystem. Each had 1 datafile of size 5g. 


Identical tables with ~1000 rows were created on both the tablespaces.  Indexes were created on both tables on relevant columns.

On a 1K Filesystem block size and a 16K DB Block size

First to confirm 1K block size

root@oracle:> fstyp -v /dev/vx/rdsk/oracledg/test1k
vxfs
magic a501fcf5 version 6 ctime Sat Dec 27 22:52:33 2008
logstart 0 logend 0
bsize 1024 size 15728640 dsize 0 ninode 15728640 nau 0
defiextsize 0 ilbsize 0 immedlen 96 ndaddr 10
aufirst 0 emap 0 imap 0 iextop 0 istart 0

I initiated both sequential and scattered reads on the tables.

A vxtrace showed that oracle was issuing requests for 16K or bigger sized requests and they were single IOs. They were not broken up into smaller IO requests as one would have normally expected.  I could not use truss because IO requests show up as ioctl calls when using ODM. There was no read I/O smaller than 32 blocks (16K) thus confirming that IOs are not split based on filesystem blocks.

------------------------------------------
1254 START read vol test1k op 0 block 4326176 len 32 <----- 16K Reads
1254 END read vol test1k op 0 block 4326176 len 32 time 0

--------CUT---------

1260 START read vol test1k op 0 block 4326048 len 128 <------ 64K Reads
1260 END read vol test1k op 0 block 4326048 len 128 time 0
1261 START read vol test1k op 0 block 4326176 len 32
1261 END read vol test1k op 0 block 4326176 len 32 time 0
1262 START read vol test1k op 0 block 4325792 len 128
1262 END read vol test1k op 0 block 4325792 len 128 time 0

------------CUT------------------------------

On a 8K Filesystem block size and a 16K DB Block size

To confirm the block size is indeed 8k

root@oracle:> fstyp -v /dev/vx/rdsk/edwrsdg/test8k
vxfs
magic a501fcf5 version 6 ctime Sat Dec 27 22:52:47 2008
logstart 0 logend 0
bsize 8192 size 655360 dsize 0 ninode 655360 nau 0
defiextsize 0 ilbsize 0 immedlen 96 ndaddr 10
aufirst 0 emap 0 imap 0 iextop 0 istart 0

I did the same set of reads as done for the 1k filesystem and it was the same.

------------CUT-----------

1265 START read vol test1k op 0 block 4326048 len 128 <------ 64K reads
1265 END read vol test1k op 0 block 4326048 len 128 time 0
1266 START read vol test1k op 0 block 4326176 len 32 <--------- 16K reads
1266 END read vol test1k op 0 block 4326176 len 32 time 0
1267 START read vol test1k op 0 block 4325888 len 32
1267 END read vol test1k op 0 block 4325888 len 32 time 0

------------CUT----------------

So the reads behave exactly like how it is documented. Oracle will do reads only in multiples of db block sizes. On either a 1K or 8k Vxfs block filesystem, a 16K or multiples of 16K reads would be sequential reads of contigous blocks and hence be satisfied from within a single IO request - as long as the IO request can be met from a single extent. 

So from an IO perspective, it really does not matter if using 1K or 8K.

Now there is other aspect to this - file system overhead, fragmentation, extent sizing and space management.

1K filesystem block size would reduce space wastage at a cost of having to manage a lot many blocks (filesytem overhead) whereas 8K filesystem block size would be ideal for an oracle instance using a DB block size of 8K or higher.

From a filesystem management perspective, using 8K filesystem block size makes better sense as Oracle would not ever store data in a size less than the DB Block size. An 8K filesystem block size reduces the number of blocks and correspondingly the filesystem overhead in maintaining these blocks.  I do not know if anyone uses a 4K DB Block size any more. All I have seen are 8K and higher. 

To reduce fragmentation,  it is best if the datafile is using a single extent (as will be when created on a database using VRTS ODM).  The extent here refers to the Vxfs Extents and not Tablespace extents. To maintain as a single Vxfs extent, datafiles should never be extended and always new datafiles should be added to increase tablespace capacity.

You can find out the extents allocated to a file by running vxstorage_stats - it is an invaluable tool.  Fragmentation status can be identified by running fsadm. Normally when using ODM, fragmentation should be minimal.

Wednesday, December 24, 2008

_realfree_heap_pagesize_hint - Assessing the impact on Solaris and Linux

The _realfree_heap_pagesize_hint in 10g provides a mechanism by which the process private memory (PGA) can use bigger memory page sizes and thus reduce TLB/TSB misses. This parameter is set in bytes.

This is especially important for Datawarehousing wherein a session can consume significant amount of anonymous memory and in many cases the workarea is bigger than the SGA.

I wrote about TLB/TSB misses from an oracle perspective in an earlier blog here.

http://dsstos.blogspot.com/2008/11/assessing-tlbtsb-misses-and-page-faults.html

 
This parameter is designed to work on the Solaris platform only, however it does work partially on Linux too and probably the same way on other platforms.

As per this hint,

  1. memory extents within the heap would be in _realfree_heap_pagesize_hint chunks.
  2. And these chunks with the memcntl(2) call,  be in _realfree_heap_pagesize_hint sized OS page (provided the pagesize is a valid choice).

For e.g. - An extent of 16MB would be carved upto into 4MB chunks and each 4M chunk would mapped to an individual 4M OS memory page (if the _realfree_heap_pagesize_hint = 4M).

Solaris:


Solaris supports four page sizes on the UltraSparc IV+ platform (8K-default, 64K, 512K and 4M). The default setting for the _realfree_heap_pagesize_hint is 65536 or 64K.

In order to test this parameter, I did a sort on a un-indexed table with approx 3.8 million rows. The avg row length was ~243 bytes and the table approx 1GB in size. The reason I selected such a big table was also to see how memory utilization changed when using different page sizes.

_realfree_heap_pagesize_hint at 65536 (Default)

This implies that when a session requests anon memory, oracle will use 64K pages. However this did not seem to be true. With a setting of 65536, only 8K pages were used.

I did a truss of the shadow process when doing the sort and this is what I observed.

-----------CUT-------------------

19167/1: 5.5795 mmap(0x00000000, 2097152, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE, 8, 3080192) = 0xFFFFFFFF7A5F0000
19167/1: 5.5796 mmap(0xFFFFFFFF7A5F0000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF7A5F0000
19167/1: 5.5813 mmap(0xFFFFFFFF7A600000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF7A600000
19167/1: 5.5829 mmap(0xFFFFFFFF7A610000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF7A610000
19167/1: 5.5846 mmap(0xFFFFFFFF7A620000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF7A620000
19167/1: 5.5863 mmap(0xFFFFFFFF7A630000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF7A630000

------------------CUT-------------------------------------

As you can see, the extent of size 2M was requested with MAP_NORESERVE and then into 64K chunks. However there is no accompaning memcntl(2) request to ask the OS to allocate 64K pages for the chunks. This is also confirmed when using pmap/trapstat.

trapstat not showing usage of any 64K pages.




pmap output showing anon pages using 8k page size.




Changing the _realfree_heap_pagesize_hint to 512K

Changing the hint to 512K shows that it indeed requests 512K pages from the OS.

------------CUT-------------

19277/1: 14.6646 mmap(0x00000000, 4718592, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE, 8, 7864320) = 0xFFFFFFFF79780000
19277/1: 14.6647 munmap(0xFFFFFFFF79B80000, 524288) = 0
19277/1: 14.6648 mmap(0xFFFFFFFF79780000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF79780000
19277/1: 14.6649 memcntl(0xFFFFFFFF79780000, 524288, MC_HAT_ADVISE, 0xFFFFFFFF7FFF7EC0, 0, 0) = 0
19277/1: 14.6909 mmap(0xFFFFFFFF79800000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF79800000
19277/1: 14.6910 memcntl(0xFFFFFFFF79800000, 524288, MC_HAT_ADVISE, 0xFFFFFFFF7FFF7F80, 0, 0) = 0

---------------CUT-----------------------

As you can see, there is the memcntl(2) call being issued to request the OS to allocate 512K page size. This is also correlated by trapstat and pmap.

trapstat output showing TLB/TSB misses for 512K pages.




pmap output for anon pages showing 512K pages being used.




Changing the _realfree_heap_pagesize_hint to 4M

Changing the hint to 4M also shows that the pagesize being requested is 4M.

Truss output -

-------------------CUT-----------------------

18995/1: 34.0445 mmap(0x00000000, 20971520, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE, 8, 390070272) = 0xFFFFFFFF53000000
18995/1: 34.0447 munmap(0xFFFFFFFF54000000, 4194304) = 0
18995/1: 34.0448 mmap(0xFFFFFFFF53000000, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF53000000
18995/1: 34.0449 memcntl(0xFFFFFFFF53000000, 4194304, MC_HAT_ADVISE, 0xFFFFFFFF7FFF7EE0, 0, 0) = 0


-----------------CUT-------------------------


Trapstat output confirming usage of 4M pages for anon memory



And finally pmap output.




So we know now that this does work as expected except for the default setting of 64K. So how does this affect performance?

  1. By using bigger page sizes, we can store more virtual to physical entries in the TLB/TSB and reduce TLB/TSB misses.
  2. Also if using bigger page sizes, it results in a reduction in the number of mmap requests thus reducing CPU spent on system time. For e.g - a 4M extent would require 512 mmap requests if using the default 8K pages, but only 1 mmap request if using a 4M page size.
  3. So memory requests can be addressed significantly faster if using bigger page sizes.
  4. However with bigger pages, one would expect that memory utilization will also go up. The basic denominator for memory requests being page sizes (8K, 512K or 4M), it is possible that there will be memory wastage.

In order to check for memory wastage, I checked the v$sql_workarea_active along with session pga/uga stats to identify how much memory was consumed with different page size setting. By sizing the PGA and using _smm_max_size appropriately, I ensured that the sort completes optimally and in memory without spilling to disk.

With the default setting of 64K

Time taken to complete - 30-32 seconds
Workarea Memory used - 1085.010 MB
session pga memory - 1102.92 MB
session uga memory - 1102.3 MB

With 512K
Time taken to complete - 24-28 seconds
Workarea Memory used - 1085.010 MB
session pga memory - 1103.73 MB
session uga memory - 1102.2 MB

With 4M

Time taken to complete - 24-27 seconds
Workarea Memory used - 1085.010 MB
session pga memory - 1112.2 MB
session uga memory - 1103.99 MB

Looking at the above stats, for the same sort operation requiring 1GB of workarea, the PGA usage is a fraction higher (~1%) with bigger page sizes. This may impact very big sorts or when multiple sessions running simultaneously - especially when doing parallel operations, so there is always the chance that you may end up with ora-4030 errors if you do not configure your instance appropriately.

Theoretically the timings should improve because of the lesser number of mmap operations and also reduced TLB/TSB misses. All in all, it probably makes sense to use this feature to enable bigger page sizes for Datawarehousing.

On Linux

On Solaris the _realfree_heap_pagesize_hint works well since four different Page sizes (8K, 64K, 512K and 4M) are supported and can be allocated dynamically. However on Linux, only two page sizes are supported (4K and 2M). The 2M pagesize can be allocated only as huge-pages which is used for the SGA. Huge-pages cannot be used for private process memory.

So in Linux, setting the _realfree_heap_pagesize_hint to bigger values only results in _realfree_heap_pagesize_hint sized chunks within extents, however not mapped to physical memory pages of the same size. Since this reduces the number of mmap requests and thus is better than the default.

With the default setting of 64K

------------CUT-------------

mmap2(NULL, 1048576, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE, 7, 0xf1) = 0xb70f1000
mmap2(0xb70f1000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb70f1000
mmap2(0xb7101000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb7101000

-------CUT-----------

As you can see from above, 64K chunks are requested.

Changing to 4M

----------CUT-----------
mmap2(NULL, 16777216, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE, 7, 0x36f1) = 0xb2af1000
mmap2(0xb2af1000, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb2af1000
mmap2(0xb2ef1000, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb2ef1000
mmap2(0xb32f1000, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb32f1000

---------CUT---------
As you can see from the above, with a setting of 4M, the chunks are 4M sized, however there is no request for a 4M page size as this is not feasible in Linux.

Changing to 8M

I was curious to see how this would play out when changing to 8M.

--------CUT-------------
mmap2(NULL, 16777216, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE, 7, 0x5af1) = 0xb02f1000
mmap2(0xb02f1000, 8388608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb02f1000
mmap2(0xb0af1000, 8388608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb0af1000

--------CUT--------------

The chunks are now 8M in size. I noticed the same behavior in Solaris too (minus memcntl to request an appropriate OS page size).




Thursday, November 6, 2008

Assessing TLB/TSB Misses and Page Faults

TLB Misses and Page Faultsv3

TLB, TSB Misses and Page Faults I was trying to explain to my team mates about TLB misses and Page faults when I realized I was not 100% confident about it myself. I spent some time reading up Solaris Internals on this topic. I also wrote to our Sun contact and got first hand information from Sun. The below is a rather simplified description of TLB/TSB misses and Page faults. Basics Memory is divided into page sized chunks. Supported Page sizes depend on hardware platform and the Operating System. The current UltraSparc platform running Solaris 10 supports 8K, 64K, 512K and 4M pages. The cool threads servers (T2000 and newer versions) running Solaris 10 supports 256M page sizes also (512K is not supported). The Terminology - TLB, TSB , HPT, HME and TTE When a process requests memory, only virtual memory is allocated. Physical memory is not allocated yet. The first time a process requests access to a page within the allocated virtual memory, a page fault occurs. As a result, a physical page (from free lists) is then mapped to the virtual page of the process. This mapping is created by the virtual memory system in software and stored in the HPT (Hash Page Tables) in the form of HAT Mapping Entries (HME). Also a copy of the entry is inserted into the TLB and the TSB as Translation Table Entries (TTE). The TLB or Translation Lookaside Buffer is a cache of the most recently used virtual to physical memory mappings or Translation Table Entries (TTE) on the CPU. There are multiple TLBs on the CPU. There is the iTLB used to store entries for text/library and the dTLB used to store entries for data (heap/stack). The number of entries in either TLB is limited and dependent on the CPU. For example, on the UltraSparc IV+ CPU, there is the iTLB which can store 512 entries. There are 2 dTLBs, each of which can store 512 entries. Since the number of entries in the TLB is limited, there is a bigger cache of the TTEs in physical RAM called the TSB (Translation Storage Buffer). Each process has its own dedicated TSB. The default and maximum size (up to 1MB/user process) that a user process TSB can grow to, can be changed in Solaris 10. The TSB grows and shrinks as need be and each process has 2 TSBs – one for 8K, 64K and 512K pages and the other for 4M pages. The maximum memory that can be allocated to all the user TSB can also be specified. And finally an entry in the TSB requires 16 bytes. So it is easy to identify the size of the TSB to hold a specified number of entries. Page Faults The CPU first checks the TLB for the TTE and if not found (TLB Miss), checks the TSB. If not present in the TSB (TSB Miss), then it checks the HPT for the HME. If not present in the HPT, it results in a Page Fault. A Minor page fault happens when the HME is not present in the HPT, however the contents of the requested page are in physical memory. The mappings need to be re-established in the HPT and the TSB and TLB reloaded with the entries. A Major page fault happens when the HME is not present in the HPT and the contents of the requested page are paged out to the swap device. The requested page needs to be mapped back into a free page in physical memory and the contents copied from swap into the physical memory page. The entries are stored in the HPT and the TSB and TLB are reloaded again with the entries. Swap and Page in/Page out Each physical memory page has a backing store identified by a file and offset. Page outs occur when the physical page contents are migrated to the backing store and Page-in is the reverse. Anonymous memory (heap and stack) use swap as the backing store. For file caching, Solaris uses the file on disk itself as the backing store. Swap is a combination of the swap device (on disk) and free physical memory. Why and when do I need to worry about TLB/TSB misses and Page Faults? As RAM gets cheaper, it is common place to see entry level systems with 16GB of memory or more as a starting point. This is for both X-86 and proprietary Unix Systems. With more available physical memory, a DBA configures oracle with bigger SGA and PGA sizes to take advantage of the available physical memory. While the above discussion is focused entirely on the Sparc platform, the concept of pages, TLB and page tables is present for all systems. If using 8K pages (Solaris) and there is 16GB of memory, then one would require ~ 2 million mappings to address the entire physical memory. If using 4K pages (Linux), then the number of mappings would be ~4 million. For maximum efficiency, relevant entries must be accessible to the CPU with minimal delay – in TLB preferably or at worst in the TSB. However, we know the number of entries the TLB can hold is limited by hardware. The TSB for a single user process (in Solaris 10 only) can be grown to a max of 1MB (65,536 entries), so it is limited too. It would not make sense to search the HPT for every TLB/TSB miss as it costs CPU cycles to search the hash mappings for required entries. And we must avoid page faults as much as possible. From an oracle perspective, if CPU wait is one of your top waits and you have ruled out other issues such as available CPUs, CPU scheduling etc and you are seeing significant increase in page faults then it probably makes sense to look deeper into TLB/TSB misses. As always, it pays to work on improving an area which can potentially deliver the biggest impact to customer experience. From my experience, the impact of TLB/TSB misses on an oracle instance can be over emphasized (Solaris Platforms) at times. So you would be the best judge to identify if this requires further analysis. What do I need to measure? Okay, so we get the idea that more RAM and bigger memory working sizes means more mappings and it is not possible to cache all the entries in TLB/TSB. So it is inevitable that there are going to be TLB/TSB misses and possibly page faults. But how do I put a price to it? How costly is a miss? How much time is spent on servicing these misses? The answer lies in using trapstat to check the % of time spent by the CPU in servicing TLB/TSB misses. Unfortunately the tool does not give an estimate of the time spent on servicing major/minor faults. To identify the number of page faults, one uses vmstat or kstat. How do I measure and analyze the impact? Running trapstat –T will show the TLB/TSB miss with the appropriate page sizes. Trapstat needs to be run as root. As you can see below, it shows the %time spent in user mode (u) and kernel mode (k). It shows both TLB and TSB misses in a page size breakdown. cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim ----------+-------------------------------+-------------------------------+---0u 0u 8k| 64k| 0 0 0 0 0.0 0.0 0.0 0.0 0 0 0 0 0.0 | 0.0 | 0.0 | 0.0 | 1 0 0 0 0.0 0.0 0.0 0.0 0 0 0 0 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 0 u 512k| 0u 4m| -----+---------------+---------------+-0k 0k 8k| 64k| 0 0 0 0 0.0 0.0 0.0 0.0 0 0 0 0 0.0 | 0.0 | 0.0 | 0.0 | 146 0 0 0 0.0 0.0 0.0 0.0 3 0 0 0 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 0 k 512k| 0k 4m| ==========+===============================+===============================+==== ttl | 619 0.0 0 0.0 | 4137 0.0 300 0.0 | 0.0 The last line gives the overall statistics for all the CPUs. If you are seeing around 20% or more time (%tim) spent on servicing TLB/TSB misses, then it probably makes sense to revisit your page sizing for your instance. Page Faults can be observed through vmstat (minor), vmstat –s (major and minor) and kstat (major and minor). The stats from vmstat –s and kstat (reports/CPU) are cumulative in nature. mkrishna@OCPD:> vmstat 1 kthr rbw memory swap free re page mf disk faults in 559 517 sy 794 767 cpu cs 811 745 us sy id 7 7 1 92 1 92 pi po fr de sr s0 s1 s6 s9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 41966192 34216352 0 5063 0 0 0 0 41925744 34175896 0 4995 0 mkrishna@OCPD:> vmstat -s 0 micro (hat) faults 2377024933 minor (as) faults 16504390 major faults mkrishna@OCPD:> kstat |egrep 'as_fault|maj_fault' as_fault maj_fault 142199182 984358 A dTSB miss results in a search for the entry in the HPT for the relevant HME. If not found in the HPT, then it results in a page fault. So perhaps a % of the time spent on dTSB miss can be assumed to be spent on servicing page faults (minor and major)? I do not know for sure and could not find out from Sun either. Since there will always be page faults when a virtual memory page is accessed for the first time, we cannot eliminate it completely. By definition, major page faults are bad, minor page faults are better than major page faults, but still need to be avoided. Ideally minor faults should be far greater than major faults. In well configured environments I have seen the ratio of major/minor faults to be < 0.5%. Major faults can occur when there is a memory shortage and heavy page out/swap outs. I have also seen a higher number of major faults when there is extensive file system data caching or double buffering happening on Oracle databases. How do I reduce TLB/TSB misses and Page Faults from an Oracle perspective? Theoretically, to reduce the incidence of TLB/TSB misses and page faults, one would use bigger sized pages to reduce the number of entries required to map a segment and use an optimally sized TSB to prevent TSB misses (TLB being fixed in size). This is assuming that you have configured the instance correctly to fit within the available physical memory. The below would be a practical way to implement it. 1. Reduce thread migrations (Harden affinity to CPUs) - Thread affinity will ensure a thread is executed on the same CPU as before. This will improve chances that the entries for the running thread are already present in the TLB on the CPU. Thread migrations can be seen using mpstat (migr column). Thread affinity is set as system parameter – rechoose_interval. The default value for rechoose_interval is 3. For a Datawarehouse system, I normally set it to 150. 2. Oracle Shared Memory - Oracle uses shared memory (SGA) and private anonymous memory (PGA). On Solaris, Oracle uses ISM for shared memory. ISM along with other benefits enables use of 4M pages and so already uses biggest possible page size on the UltraSparc IV+ platform running Solaris 10. Also for processes sharing the same segments, TSB is shared. So by default, when using ISM for the SGA, Oracle is already well optimized for minimal TLB/TSB misses. For the cool threads platform (Solaris 10), a mix of 256M and 4M Page sizes is used for ISM segments and so is even better optimized. 3. Oracle PGA - For your PGA or private memory, the page size setting is controlled by the parameter _realfree_heap_pagesize_hint (10g). The default value is set to 64K and so should use a 64K page size. However, it does not seem to be so. I have observed that when set at 64K, it uses 8K pages only. However setting it to 512K or 4M does indeed change the page size for PGA usage to 512K or 4M. Setting this parameter results in memory being allocated in realfree_heap_pagesize_hint sized chunks (64K, 512K, 4M) and so can potentially result in wastage of memory and starve other sessions/applications of physical memory. Setting this to 512K/4M also reduces your page faults considerably. 4. TSB Configuration - Increase the size of default startup TSB (Solaris 10) to prevent TSB misses. 1 entry in the TSB requires 16 bytes. So depending on your memory allocation to the SGA and PGA, you can set the default TSB size accordingly. Each process can have up to 2TSB with one of the TSB being dedicated to service 4M Page entries. There are several configuration parameters that can be set in the /etc/system. a. default_tsb_size – The default value is 0 (8KB). 8KB will hold 512 entries. For Oracle, you have to consider both PGA and SGA usage. Let us assume you have configured 12GB for your SGA (using ISM with 4M pages as default) and 6GB PGA (using 4M page size). 12GB of SGA would require 3072 entries or 48KB TSB. 6GB of PGA would result in global memory bound of ~700MB (serial operations 175 pages of 4M each) or ~2100 MB (parallel operations – 525 pages of 4M each). So for this case, a default_tsb_size of 8K would be too small and get resized frequently. A default size of 32KB (default_tsb_size = 2) which can then grow accordingly (to a max of 1M) would be preferable. The problem with having bigger default sizes is that it consumes physical memory, which is however capped by the tsb_alloc_hiwater_factor. b. tsb_alloc_hiwater_factor – Default is 32. This setting ensures that total TSB usage on the system for user processes does not exceed 1/32 of physical memory. So if you have 32GB of memory, then total TSB usage will be capped at 1GB. If you have lots of memory to spare and expect a high number of long lived sessions connecting to the instance, then this can be reduced. c. tsb_rss_factor – Default is 384. Value of tsb_rss_factor/512 is the threshold beyond which the tsb is resized. The default setting is 75% (384/512). It probably makes sense to reduce this to 308 so that at 60% utilization of the TSB, it will get resized. d. tsb_sectsb_threshold – In Solaris 10, each process can have up to 2TSB – one for 8K, 64K and 512K pages and one for 4M pages. This setting controls the number of 4M mappings the process must have before the second TSB for 4M pages is initialized. It varies by the CPU. For a UltraSparc IV, the default is 8 pages. 5. To reduce page faults from user sessions, change _realfree_heap_pagesize_hint from 64K to either 512K or 4M. Also use ODM or Direct i/o. Avoid file system buffering for oracle data files. 6. Also ensure that the memory requirements of oracle can be met entirely within the physical memory.

Tuesday, September 30, 2008

Redo configuration for a High Transaction Environment

We resolved a number of LGWR issues and the steps we took are detailed in the presentation.


Redo Configuration for Sustained and Scalable Performance in a Real World Environment Krishna Manoharan krishmanoh@gmail.com 1 Background During routine capacity planning of an Operational Data Store, it was noticed that large extracts were not scaling as predicted. Among other things, higher log related waits were noticed when conducting the analysis. A study was performed to reduce/eliminate log related waits. 2 The Environment Oracle 10.2.0.3 Enterprise Edition 64bit on Solaris 9 Database Size – 4 TB SunFire E4800 with 12 CPU’s, 48GB RAM and 2 IO Boards. Work Load – Numerous small transactions due to replication and large batch process through ETL. Operations in logging mode. Commits done for every row change (replication) to commits after million+ rows changes (ETL). 3 Work Load Profile (redo) of the instance (Peak Hours) # Session Session Session Session LGWR LGWR LGWR Stat commit wait requested commit wait performed redo entries generated redo size redo synch writes redo writes lgwr throughput Value 72/sec 72/sec 13326 entries/sec 9.45MB/sec 72 sync writes/sec 74 writes/sec 9.8 MB/sec 4 Objectives of the study Deliver predicted scalability by Reduce/Eliminate log related waits. Improve on log related statistics. Eliminate LGWR/ARCH as a bottleneck. Performance as measured by Improved transaction rates. Meeting pre-defined thresholds for waits and stats. Build a standard for an optimal and scalable redo/archive configuration. Performance tuning measure – not capacity planning. Limited to configuration only, no code changes. 5 Symptoms from Oracle Top log related wait statistics (Peak Hours) Existing (average wait time) Negligible Negligible 28.6 ms 11.8 ms 7.02 ms 20.82 ms 26.45 ms 0.38 ms Threshold (average wait time) Eliminate Eliminate Eliminate < 2ms < 10ms ? < 5 ms Eliminate Event latch: redo allocation latch: redo writing log buffer space log file parallel write log file sequential read log file switch completion log file sync log file switch (checkpoint incomplete) Waits/sec Negligible Negligible 3.5 waits/sec 74 waits/sec 10 waits/sec 0.5 waits/sec 72 waits/sec 0.5 waits/sec 6 Symptoms from Oracle – contd. Top log related instance statistics (During peak hours) # LGWR LGWR Session Session Session Statistic Existing Threshold redo sync response time 0.27 ms/sync write < 0.1 ms redo write response time 0.12ms/write < 0.1 ms/write redo buffer allocation retries 0.0002 retries/entry, 3.8 retries/second Eliminate redo log space requests 0.000081/entry, 1 request/second Eliminate redo log space wait time 0.12ms/wait for space < 0.1 ms/wait for space 7 Symptoms from the System – contd. Top system related statistics (During peak hours) # CPU CPU Parameter Average Run Queue Average Involuntary context switches for LGWR Before 1.2 30% 8ms (Reads) 12ms (Write) 8ms (Reads) 12ms (Write) 8ms (Reads) 12ms (Write) Threshold No Change Eliminate 10ms (Reads) 2ms (Write) 10ms (Reads) 2ms (Write) 10ms (Reads) 2ms (Write) Storage Average Redo Lun response time Storage Average File response time (redo logs) Storage Average Redo Volume Response Time 8 Existing Log Configuration (Instance) No _ parameters set. log_buffer – Default (Seen as 4MB) Redo log groups – 3 Size of members – 500M log_archive_max_processes – Default (seen as 2) Using VRTS ODM and Vxfs 9 Existing System/Storage Configuration Default scheduling class – TS and Default priorities for Oracle. Thread Affinity set to 150. Storage Foundation 4.1 MP2 for Oracle Enterprise Edition. Maxphys set to 8M (system and Vxvm) Lun Queue Depth – 30 with a max of 256/Target. All luns – Raid 1 using 72GB, 15K RPM FC Drives. Storage – Hitachi AMS1000 Dual Simultaneous Active (2Gbit) Paths to each lun. Load Balancing via vxdmp. 10 Existing Physical Log Configuration (Filesystem) Redo Members LOG01A.dbf - 500M (Primary of Group 1) LOG02A.dbf - 500M (Primary of Group 2) LOG03A.dbf - 500M (Primary of Group 3) Redo Members LOG01B.dbf - 500M (Mirror of Group 1) LOG02B.dbf - 500M (Mirror of Group 2) LOG03B.dbf - 500M (Mirror of Group 3) Filesystem 1 (vxfs) /u05/redo1 Filesystem 2 (vxfs) /u05/redo2 Raid 1 Single Lun of 66.4GB } 72GB, 15K RPM FC 72GB, 15K RPM FC 11 Log Waits - Schematic User Session Event Log File Parallel Write CPU Subsystem Event Log File Switch Completion Log Group 1 Redo Logs Log Group 2 Shared Log Buffer Private Redo Strands Archive Log Files SGA CPU Subsystem Event Log Archive IO ARCH Event Log File Sequential Read } LGWR Event Log File Sync Event Log Buffer Space IO Subsystem 12 Analysis of the symptoms LGWR related Wait - log file parallel write - “Writing redo records to the redo log files from the log buffer. The wait time is the time it takes for the I/Os to complete.” High average wait time (11.8 ms). Correlating Stats High Lun response time for the redo log filesystem. High redo sync time (0.27 ms/sync write). High redo write time (0.12 ms/write). Log redo sync writes (72 writes/sec). Higher buffer allocation retries (3.8 retries/sec). High degree of involuntary context switches for the LGWR process. 13 Analysis of the symptoms (contd.) Wait - log file parallel write (contd.) With a high degree of commits, it is more important to review the average response time/wait rather than the number of waits. Every commit write wait results in a increment to the log file parallel write event. The physical layout of the redo log filesystems show a single lun used for all the groups. Since redo log members are relatively small sized, it is common practice for the Storage/System Admin to assign a single lun which is then used for all the redo filesystems. This invariably leads to poor IO performance and a slower LGWR. A slower LGWR also results in poor commit performance (sync writes) as evidenced by correlating stats. A slower LGWR results in higher buffer allocation retries because the LGWR is unable to write the redo entries and flush the buffer to the disk fast enough to meet the session requirements. The overall run queue on the system was low, however involuntary context switching (~30%) indicated that LGWR was being switched out of the CPU before it could complete it’s task. The high lun response time for the redo log filesystem’s indicated that IO was a bottleneck. 14 Analysis of the symptoms (contd.) LGWR related Wait - log file switch completion – “Waiting for a log switch to complete” Wait - log file switch (checkpoint incomplete) – “Waiting for a log switch because the session cannot wrap into the next log. Wrapping cannot be performed because the checkpoint for that log has not completed” Large number of waits (0.5 waits/sec) with high average wait time (20.82 ms). Correlating Stats redo log space requests – 1 request/sec redo log space wait time – 0.12ms/entry 15 Analysis of the symptoms (contd.) Wait - log file switch completion (contd.) During a log file switch, redo generation is disabled. So this wait directly impacts session performance. The log members were only 500M in size and thus causing frequent log switches (every 1 minute). This will result in higher waits. The log_buffer is 4M in size and during a log switch, the log buffer is flushed to disk. If there is an IO bottleneck to the redo log files, then flushing 4M of log buffer could result in higher response times. Since the redo log groups were on the same set of Filesystems, there could possibly be a IO conflict between the checkpoint and LGWR processes when doing a log switch as shown in the wait log file switch (checkpoint incomplete). However a bigger log file can also cause slower log file switches. The impact of increasing the log member size needs to be studied with respect to the event – log file switch completion. 16 Analysis of the symptoms (contd.) ARCH related Wait - log file sequential read – “Waiting for the read from this logfile to return. This is used to read redo records from the log file – either for recovery or archival. The wait time is the time it takes to complete the physical I/O (read)” High number of waits (10 waits/sec) with high average wait time (7 ms). Correlating Stats High Lun response time for the redo log filesystem. Event – log file parallel write (high average wait time – 11.8 ms) 17 Analysis of the symptoms (contd.) Wait - log file sequential read (contd.) Small sized redo log members cause frequent log switches (1/minute). These logs need to be archived and thus indirectly impacts the event log file sequential read. Members of the redo groups were located on the same filesystems and share the same physical LUNS. This results in IO contention because the ARCH process is reading from the previous group as the LGWR is writing to the present group . This in turn impacts LGWR write performance thus resulting in increased response time for the events - log file parallel writes and log file sync waits. Poor archival performance can also indirectly impact log switches as reported in the event – log file switch (archival incomplete) and thus session performance. For 500 M log members, the average response time is on the higher side again indicating an IO contention. Since redo log members are relatively small sized, it is common practice for the Storage/System Admin to assign a single lun which is then used for all the redo filesystems. The nature of the access being sequential, this problem is multiplied in effect – especially if the lun is Raid 5. Increasing log file sizes can also cause this event to report higher wait times. 18 Analysis of the symptoms (contd.) Session related Wait - log buffer space – “Waiting for space in the log buffer because the session is writing data into the log buffer faster than LGWR can write it out” High number of log buffer space waits (3.5 waits/sec) with an average response time of 28.6 ms. Correlating Stats Event – log file parallel write (high average wait time) redo buffer allocation retries (3.8 retries/sec). 19 Analysis of the symptoms (contd.) Wait - log buffer space (contd.) This along with the high response time for the log file parallel writes wait shows a slow LGWR. The presence of higher redo log buffer allocation retries also correlate this wait. It also can mean that the default log buffer (4MB) is too small for the rate of redo generation (9.45 MB/sec). During a log switch, LGWR flushes the log_buffer to disk. So the impact of increasing the size of the log_buffer needs to be analyzed with respect to the event – log file switch completion. 20 Analysis of the symptoms (contd.) Session related Wait - log file sync – “When a user session commits, the session's redo information needs to be flushed to the redo logfile. The user session will post the LGWR to write the log buffer to the redo log file. When the LGWR has finished posting, it will post the user session. The wait time includes the writing of the log buffer and the post.” The average wait time was 26.45 ms. Correlating Stats Event – log file parallel write (high average wait time – 11.8 ms) High degree of involuntary context switches for both user session and LGWR. 21 Analysis of the symptoms (contd.) Wait - log file sync (contd.) Every commit write wait/immediate will result in an increment of the wait counter and a redo write (resulting in an increment to the log file parallel write wait counter). Rather than the number of waits, the average wait time is important for this wait event. Under ideal circumstances, the average wait time for a log file sync event should be the about the same as the average wait time for the wait – log file parallel write. If there is a difference, then it probably indicates a CPU bottleneck for the session. Higher wait times can be a result of slow LGWR as well as CPU bottleneck (evidenced by high involuntary context switches for session processes) 22 Initial Conclusions From the waits and stats, we came to following conclusions LGWR The underlying IO subsystem for the redo logs needed to be improved. The redo log members needed to be resized from 500M to a suitable size. Also increase the groups from 3 to 4. Reduce LGWR involuntary switches by addressing OS scheduling issues. 23 Initial Conclusions (contd.) ARCH Separate the redo log groups onto dedicated filesystems to prevent contention between ARCH and LGWR. Session Increase log buffer from the default to a suitable value taking into consideration impact on the event log file switch completion. 24 Final Configuration Details After 30 or so runs, we finally arrived at the below optimal configuration. Redo Filesystem configuration (to address IO issues) Striped filesystems on dedicated Raid 1 luns configured for the redo logs as shown in the next slide. Filesystem is vxfs with 8k block size. Stripe Width = 1M LGWR configuration (to address involuntary context switches) The FX scheduling class was set for the LGWR process. The CPU time quantum was increased to 1000 and the priority set to 59. # priocntl -s -c FX -m 59 -p 59 -t 1000 -i pid The thread affinity was set to 150 for the entire system, however we decided it to be best if the LGWR was bound to a specific CPU. # psrset –c # psrset –b 1 25 New Filesystem Layout and redo group placement Redo Members LOG01A.dbf - 1500M (Primary of Group 1) LOG03A.dbf -1 500M (Primary of Group 3) Redo Members LOG02B.dbf - 1500M (Mirror of Group 2) LOG04B.dbf -1 500M (Mirror of Group 4) Redo Members LOG02A.dbf - 1500M (Primary of Group 2) LOG04A.dbf -1 500M (Primary of Group 4) Redo Members LOG01B.dbf - 1500M (Mirror of Group 1) LOG03B.dbf -1 500M (Mirror of Group 3) Filesystem 1 (vxfs) Layout=Stripe STWidth = 1M /u05/redo1 Filesystem 2 (vxfs) Layout=Stripe STWidth = 1M /u05/redo2 Filesystem 3 (vxfs) Layout=Stripe STWidth = 1M /u05/redo3 Filesystem 4 (vxfs) Layout=Stripe STWidth = 1M /u05/redo4 } Raid 1 Single Lun of 66.4GB Raid 1 Single Lun of 66.4GB 72GB, 15K RPM FC } 72GB, 15K RPM FC 72GB, 15K RPM FC } 72GB, 15K RPM FC 26 Final Configuration Details (contd.) Redo groups 4 redo groups configured with 2 members each. The log members were placed in such a manner on the redo filesystems to eliminate LGWR and ARCH IO contention. Each member was 8G in size (8G log members would reduce the log switches from 1 switch per minute to 1 switch every 7 minutes). Reducing log switches improves performance as during a log switch, redo generation is disabled. 8G was an ideal size – log archiving completed within 2 minutes whereas log switches happened every 7 minutes. Increasing the log member size resulted in higher wait times for the events – log file switch completion and log file sequential read. However the overall performance gain was well worth it. 27 Final Configuration Details (contd.) Session The log buffer was set to 72M (after several iterations). A 72M log buffer along with 8G log file members resulted in an higher response time for the event – log file switch completion. However we completely eliminated the wait event – log buffer space (even when simulating 1.5X load). 72M appears to be an ideal size for a redo generation rate up to 14MB/sec. The _log_io_size is set to a maximum of 1M irrespective of the log_buffer size once the log_buffer crosses ~ 6MB. Also since we had a Storage subsystem which was quite capable of handling upto 32M in a single write within acceptable response time, we did not downsize the log_io_size. 28 Final Configuration Details (contd.) Session (contd.) Improving the LGWR write performance however resulted in the redo allocation latch contention. To reduce the redo allocation latch contention, we increased the parallelism for the shared redo buffer from the default of 2 to 12. _log_parallelism_max = 12 # Default is 2. Max - Limited to CPU count _log_parallelism = 4 # Default is 1 By enabling log_parallelism, the shared log buffer is split into log_parallelism_max sections each assigned a redo allocation latch. As per oracle documentation, the redo allocation latch for the shared log buffer is randomly assigned to the requestor and then does a round-robin allocation. We did notice that this was not an optimal way of assignment. 29 Final Configuration Details (contd.) ARCH The ARCH process reads OS sized blocks as set by the _log_archive_buffer_size parameter. The default and maximum value on Solaris with Oracle 10g is 2048 OS blocks (equates to 1MB reads). So the archive logs filesystem was also created as a stripe filesystem with 1MB stwidth. Performance improved as the redo logs filesystems and the archive filesystems were both stripe filesystems with 1MB stripe width. Average ARCH throughput was around 150MB/sec. However we did notice that the ARCH process reads from the primary group member only. It does not read simultaneously from both the members. We did not change the log_archive_max_processes from default (2). 30 Final Results Peak Work load showed an improvement of 7x. Least improvement was 4x. At 1.5X load, the scalability was near linear. 31 The results – Work Load Profile (redo) # Session Session Session Session LGWR LGWR LGWR Stat Before After commit wait requested 72/sec 520/sec commit wait performed 72/sec 520/sec redo entries generated 13326 entries/sec 14677 entries/sec redo size 9.45MB/sec 10.1MB/sec redo synch writes 72 sync writes/sec 520 sync writes/sec redo writes 74 writes/sec 845 writes/sec lgwr throughput 9.8 MB/sec 10.75 MB/sec 32 The results – Waits Before Event latch: redo allocation latch: redo writing log buffer space log file parallel write log file sequential read log file switch completion log file sync log file switch (checkpoint incomplete) Waits/sec Negligible Negligible 3.5 waits/sec 74 waits/sec 10 waits/sec 0.5 waits/sec 72 waits/sec 0.5 waits/sec Before (average wait time) Negligible Negligible 28.6 ms 11.8 ms 7.02 ms 20.82 ms 26.45 ms 0.38 ms Waits/sec 0.002 waits/sec 0 waits/sec 0 waits/sec 845 waits/sec 10.5 waits/sec 0.02 waits/sec 519 waits/sec 0 waits/sec After Target After Threshold (average wait time) (average wait time) 0.9 ms Eliminate 0 ms Eliminate 0 ms Eliminate 0.55 ms < 2ms 16.62 ms ? 31.5 ms ? 2.13 ms < 5 ms 0 ms Eliminate 33 The results – Stats 34 The results – System # CPU CPU Parameter Average Run Queue Average Involuntary context switches for LGWR Before 1.2 30% 8ms (Reads) 12ms (Write) 8ms (Reads) 12ms (Write) 8ms (Reads) 12ms (Write) After 1.2 < 0.1 % 16ms (Reads) < 1ms (Write) 16ms (Reads) < 1ms (Write) 16ms (Reads) < 1ms (Write) Threshold No Change Eliminate 10ms (Reads) 2ms (Write) 10ms (Reads) 2ms (Write) 10ms (Reads) 2ms (Write) Storage Average Redo Lun response time Storage Average File response time (redo logs) Storage Average Redo Volume Response Time 35 Final Thoughts In order of biggest impact to performance (in descending order), 1. IO Subsystem (50%) 2. Redo Groups layout and sizing of log file members (20%) 3. CPU Scheduling (15%) 4. Log Buffer (10%) 5. Log Parallelism (5%) The LGWR process in 10g is incredibly efficient requiring minimal tuning, however it would have been ideal if there was dedicated LGWR for each shared strand. One can only imagine the performance gain with multiple LGWR each servicing distinct log buffers. 36

Thursday, August 7, 2008

Oracle Database Capacity Planning

I do capacity planning for the EBI infrastructure at the place I work and created a presentation to share the methodology with my team.

Read this document on Scribd: Oracle Database - Capacity Planningv3

Oracle Database – Capacity Planning Krishna Manoharan krishmanoh@gmail.com 1 Introduction – Capacity Planning Capacity planning is essential to deliver a predetermined optimal/consistent user experience throughout the lifecycle of a solution. Capacity planning refers to identifying the changes (from a capacity perspective only) needed to be done to the environment to maintain this predefined user experience over the lifecycle of the solution. In the simplest of terms, this changes can refer to adding more CPU/Memory/Storage/Network capabilities along with suitable configuration changes to the application (Grid, version upgrades, 64bit vs 32bit etc) as and when identified to be required. 2 Capacity Planning or Performance Tuning ? Capacity planning is pro-active whereas performance tuning is mostly reactive. Capacity planning is anticipating demand ahead of time and recommending suitable changes to the environment. Capacity planning (unlike performance tuning) is not an exact science in the sense it requires some guess work based on prior history and experience with the environment. I would feel, Performance tuning is trying to get the best out of existing infrastructure – for e.g – rewriting sql, creating an index etc. When a user complains of poor performance of the application, it is important that you establish if this is a capacity constraint resulting in sub optimal user experience or code/application issues. Capacity planning can help identify performance issues early on. 3 Capacity Planning Model Collect Stats New Requirements Establish Pattern and Behaviour Monitor (Update Profile with Stats on a Regular basis) Create Profile Thresholds Predict Possible Threshold Violations Threshold Violations Establish Thresholds One Time Define Action Plan for threshold violations Performance (Application) Capacity Constraint Resolve Change based on Action Plan Thresholds under control 4 Collecting stats & Profiling The first step is to identify suitable stats and capture them (assuming the application is in steady state). Statistics is from the application as well as the from the infrastructure (CPU/Memory/Storage etc). Then, one needs to start with profiling the application. Profiling the environment will help in Understanding the needs of the environment Correlating statistics from the application with the infrastructure. Charting and predicting growth using the previously established thresholds and As a result - proper capacity planning to meet the growth. 5 Profiling an application (contd.) Profile is basically a snapshot of the application. It enables you to see how it is performing with key statistics and changes over a period of time. Profiling can in-turn help identify performance issues and bottlenecks as an additional benefit during the process of capturing statistics. Once profiling is done, the next step is to establish thresholds. 6 Thresholds Thresholds indicate your comfort level – for e.g. – Redo/Day cannot exceed 50GB/day beyond which I need to revisit my redo configuration. Thresholds need to be defined and set for the key statistics in the profile. Also you identify the course of action to be followed if a threshold is violated. Reviewing the key statistics in the profile on a daily/weekly basis will allow you to plan in advance as to what changes need to be done. 7 Oracle perspective How can I do capacity planning on a pro-active basis for my oracle instance? 8 Oracle – Capacity planning The answer to this lies in reviewing, collating and corroborating Oracle statistics with statistics from various other subsystems such as OS, Storage and Network over a period of time. The key is to know which statistics to look at, how to interpret the numbers and establish thresholds. It is essential to know when to drill down into session level stats and when to stick to the top level as otherwise the stats will become overwhelming. Not to forget, Capacity planning is pro-active whereas performance tuning is mostly reactive. 9 Oracle Stats and Wait events From an oracle perspective, both stats as well as wait events needs to be captured on an ongoing basis. One would capture stats at an instance level and if required at a session level. To begin with, one can start with instance level stats collected every 24 hrs. The finer the interval, the more accurate the results, however it can get very cumbersome. It is best not to use the dba_hist views/AWR, but rather collect the stats from the v$views. The v$views are mostly incremental views and contain data from the instance startup time. 10 Oracle Stats and Wait events – contd. Stats can be collected for Work Load User related (Transactions, logons, parses etc) Redo activity Undo activity Temp activity Tablespace and object space usage PGA usage SGA usage Parallel Operations IO Operations File Stats and Temp Stats 11 Oracle Stats and Wait events – contd. Wait events help mostly in performance tuning and identifying steady state behaviour. For wait events, Top 10 waits including CPU time ordered by Time Waited along with Average Wait time Total Waits Wait class Filter out idle and parallel (PX*) waits. 12 Infrastructure Statistics From an Infrastructure perspective, to begin with the following stats can be collected. CPU – Utilization, run queue, context switches (voluntary and involuntary), interrupts, system calls, thread migrations) Storage – Number of IOPS/second, Queue Depth, Size of IOPS, Response time (lun level, volume and file), throughput. Filesystem – Usage, response time and growth. Memory – Physical memory consumed, swap in/out, page faults Network – Throughput and details from netstat –s and kstat. It is important to note that OS stats are generally not event driven and are time sampled. So they need to be correlated with application stats to make sense. 13 Basic Oracle Instance profile These stats allow us to create a simple and basic profile of the instance which can be used for daily reporting (shown next slide). It is important to note that even though many magnitudes of statistics are collected everyday, the profile should present only sufficient information to enable a decision to warrant further investigation if required. 14 A Simple profile for a Datawarehouse instance 15 A Simple profile – contd. DW Event read by other session db file sequential read db file scattered read direct path read temp log file sync log file parallel write direct path write temp db file parallel write control file parallel write os thread startup Class User I/O User I/O User I/O User I/O Commit System I/O User I/O System I/O System I/O Concurrency Average Wait (centiseconds) -90 days ( % Delta) -30 days (% Delta) -7 days (% Delta) 2 2.1 3 0.6 0.5 0.5 0.8 0.76 0.81 1.56 1.8 1.6 Not Present Not Present 0.5 Not Present Not Present 0.12 1.73 1.5 1.32 0.16 0.12 0.15 Not Present Not Present 0.98 18 14 18 Today 2 0.6 0.8 1.6 1 0.2 1.7 0.15 1 18 Threshold 2 0.5 <1 < 1.5 0 0 <1 < 0.1 0 < 12 16 Oracle - Capacity planning (contd.) To summarize Profile the environment Collect and collate initial set of statistics when environment is steady state and user response time is deemed satisfactory – Oracle, OS, Storage, Network . Define and establish thresholds – Oracle, OS, Storage and Network. As before, user response time should be deemed satisfactory. Repeat statistics collection over a defined period of time – Maybe monthly or quarterly. Establish a pattern of change – certain statistics increase over a period of time, whereas others decrease. Based on the pattern of change, plan on adding additional capacity. At any point during this time, bottlenecks can be identified and resolved accordingly. 17 Oracle Stats and Waits – v$views Common Views v$sysstat v$sys_time_model v$pgastat v$sgainfo v$filestat v$tempstat dba_free_space v$sesstat v$system_event v$session_event v$segstat Comments Most oracle statistics CPU Wait Statistics PGA Statistics SGA Statistics File IO Statistics Temp file statistics Tablespace space usage Session Statistics Wait Statistics Session Wait Statistics Segment Statistics 18 Oracle Stats Detail (Can be collected on a daily basis) Workload NAME db block changes DB time CPU used by this session Redo NAME redo buffer allocation retries redo log space requests redo log space wait time redo blocks written redo entries redo size redo writes background checkpoints completed redo synch writes redo synch time redo size redo write time redo wastage User related NAME opened cursors cumulative parse count (failures) parse count (hard) parse count (total) parse time cpu execute count logons cumulative user commits user rollbacks Source v$sysstat v$sysstat v$sysstat Source v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat Source v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat PGA NAME aggregate PGA target parameter aggregate PGA auto target maximum PGA allocated global memory bound total PGA used for auto workareas over allocation count cache hit percentage sorts (disk) sorts (memory) sorts(rows) workarea executions - multipass workarea executions - onepass workarea executions - optimal workarea memory allocated SGA NAME Buffer Cache Size Shared Pool Size Large Pool Size Maximum SGA Size Free SGA Memory Available prefetched blocks aged out before use Undo NAME consistent gets undo change vector size consistent changes DBWR undo block writes transaction rollbacks Source v$pgastat v$pgastat v$pgastat v$pgastat v$pgastat v$pgastat v$pgastat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat Source v$sgainfo v$sgainfo v$sgainfo v$sgainfo v$sgainfo v$sysstat Source v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat 19 Oracle Stats – Sample Parallel NAME DDL statements parallelized DFO trees parallelized DML statements parallelized Parallel operations downgraded 1 to 25 pct Parallel operations downgraded 25 to 50 pct Parallel operations downgraded 50 to 75 pct Parallel operations downgraded 75 to 99 pct Parallel operations downgraded to serial Parallel operations not downgraded queries parallelized IO Related NAME physical read total bytes physical read total IO requests physical reads direct physical reads direct temporary tablespace physical read total multi block requests physical write total bytes physical write total IO requests physical write total multi block requests physical writes direct physical writes direct temporary tablespace user I/O wait time Source v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat Source v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat (Can be collected on a daily basis) Enqueue NAME enqueue timeouts enqueue waits enqueue deadlocks enqueue requests enqueue conversions enqueue releases Table and Index NAME table scans (short tables) table scans (long tables) table scans (rowid ranges) table scans (direct read) table fetch by rowid table fetch continued row index fast full scans (full) index fast full scans (rowid ranges) index fast full scans (direct read) index fetch by key Source v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat Source v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat 20 Oracle Stats Detail – Sample (Can be collected on a daily basis) DATE 01-Aug-08 02-Aug-08 03-Aug-08 04-Aug-08 EVENT db file scattered read db file scattered read db file scattered read db file scattered read TOTAL_WAITS TOTAL_TIMEOUTS TIME_WAITED AVERAGE_WAIT 1838346 0 6033377 0.8 1906533 0 6034577 0.75 1754866 0 5965344 0.9 2356571 0 6154334 0.23 WAIT_CLASS User I/O User I/O User I/O User I/O 21 Oracle Stats Detail Used Space in MB Date 01-Aug-08 02-Aug-08 03-Aug-08 04-Aug-08 Avl Space in MB Date 01-Aug-08 02-Aug-08 03-Aug-08 04-Aug-08 Tablespace1 25000 25120 25220 25989 (Can be collected on a daily basis) Tablespace2 Tablespace3 Tablespace n 31000 14000 13210 32001 14990 13210 32150 15010 13210 33000 15201 13210 Tablespace1 10000 9880 9780 9011 Tablespace2 Tablespace3 Tablespace n 4000 21000 21790 2999 20010 21790 2850 19990 21790 2000 19799 21790 22 Oracle Stats Detail Datafile /DW/dat01/file1.dbf /DW/dat01/file2.dbf /DW/dat01/file3.dbf /DW/dat01/file4.dbf .. .. .. /dev/vx/rdsk/dwdg/dwtmp0 (Can be collected on a daily basis) IOPS/Day Avg Response Time/Day Max Response Time/Day 35000 10ms 24ms 120000 12ms 15ms 68461 15ms 18ms 58799 8ms 10ms 130000 30ms 68ms 23 Infrastructure Statistics (Can be collected on a daily basis) Infrastructure Stats cpu user time cpu sys time context switches (inv and vol) system calls Filesystem usage Thread migrations Interrupts Run queue Network Stats Memory Stats File IO Stats (Complements oracle) Volume Stats Lun Stats Queue depth, throughput, response time Storage allocated Comments sar, vmstat, mpstat sar, vmstat, mpstat mpstat sar, vmstat, mpstat df mpstat vmstat, mpstat vmstat, sar, w netstat and kstat vmstat odmstat vxstat vxdmpadm, swat and iostat vxstat, odmstat, swat, iostat vxdg 24

Friday, May 9, 2008

Simple Userland tools on Unix to help analyze application impact as a non-root user - CPU subsystem

I recently gave a presentation to my team on how to analyze the impact of the application on the CPU subsystem.

Read this document on Scribd: Analyze application impact - CPU Subsystem

Analyzing Application impact using simple userland tools – CPU subsystem Krishna Manoharan krishmanoh@gmail.com 1 Introduction Every application impacts the host Operating system and connected sub-systems in a unique way. In order to profile an application and understand its impact on the environment, there are a number of userland tools provided within the OS. Many of these tools do not require super-user privileges thus enabling ordinary users such as dba’s or application developers, the ability to see and gauge the impact of the application on the system. 2 Subsystems in an environment One needs to analyze the impact of an application on all the major subsystems in an environment.  CPU  Memory  Storage  Network 3 CPU Subsystem - Metrics The key metrics to look for when measuring CPU utilization is  Run queue – The number of running threads or threads waiting to run on a CPU.  Utilization – CPU utilization  Context switches (Voluntary and Involuntary) - Context switch is when a kernel thread is suspended from execution on the CPU with its current state captured and another thread is scheduled.  Thread Migrations – Shows kernel threads which re-awake after a context switch and are running on a different CPU than previous run.  Interrupt generation – Interrupts are signals from hardware devices such as network, disk etc or from software which force the CPU to process them immediately.  System Calls - System calls allow an application to request the kernel to perform privileged operations such as IO, new threads etc.  Translation Lookaside Buffer Miss – TLB is a table maintained in the CPU which contains a subset of mappings of virtual memory addess to actual physical address. When a CPU need to convert a virtual memory address to a physical address, it searches the TLB first and if not found, it is a miss. TLB misses are expensive. 4 CPU Subsystem - Tools To measure the impact of an application on the CPU subsystem, the tools available are  sar - Gathers and reports on a variety of statistics. Make sure it is set to 5 minute intervals rather than the default.  vmstat – Reports CPU related statics  mpstat - Reports CPU related statistics  prstat - Reports process related statistics  truss - Used in debugging  ps – Reports process related statistics  trapstat - Reports TLB misses. Of all these, only trapstat needs to be run as root. 5 CPU Subsystem – Run Queue There are many schools of thought regarding the size of the run queue. Normally one would say that a run queue of 2 or more shows a CPU bottleneck. On a SMP system, each CPU has its own run queue and so for a 8 CPU system, would a run queue of 16 would be okay? It all depends on what the user perception is. You may have a run queue of 100 and still be okay with it if you are meeting your SLA. I personally have seen Tibco Business Works running on Solaris and racking up a sustained run queue of +80 on a 4 CPU V440 and still be averagely responsive. Similarly a Oracle 10g RAC installation running on a Dell 2950 with 8 CPU's, a sustained run queue of +12 would leave the system pretty much unresponsive and unusable. I would feel that a sustained run queue of 10 or more coupled with a lot of involuntary context switches means you definitely have a CPU bottleneck. All this assuming that your average CPU utilization is well within limits (average 30-50%). The important thing to note about the run queue when analyzing application impact is that none of these tools discussed above will tell you if the run queue is due to the application or other process. In order to find out what exactly is causing it, you would need to use prstat. 6 CPU Subsystem – Run Queue contd. The run queue can be obtained either using vmstat, w or sar. kthr rbw memory swap free re page disk in faults sy cpu cs us sy id mf pi po fr de sr s9 s3 s3 s3 0 13 0 84634792 97274944 103 573 96 26 24 0 0 0 0 0 0 4377 31853 14630 23 3 74 The run queue can also be obtained using the sar command. This gives a historic perspective. mkrishna@tus1dwhdbspex01:> sar –q SunOS tus1dwhdbspex01 5.9 Generic_122300-16 sun4u 00:00:00 runq-sz %runocc swpq-sz %swpocc 00:05:00 1.0 2 0.0 0 06/04/2008 The w command also gives the load average for the last 1, 5, 15 minutes. The load average is the run queue. mkrishna@tus1dwhdbspex01:> w 8:30pm User root up 24 day(s), 22:12, tty console login@ 4 users, idle load average: 7.71, 8.87, 8.86 JCPU PCPU what format 10May0825days 7 CPU Subsystem – Utilization Measure of how busy the CPU is. CPU can be busy either processing high complexity (compression, encyption, Math) threads resulting in high cpu utilization or busy processing medium to low complex threads showing lower utilization. You can have a high run queue and low CPU utilization. Overall CPU utilization can be obtained from  vmstat  sar Process wise CPU utilization can be obtained from  prstat  ps Individual CPU utilization can be obtained from  mpstat 8 CPU Subsystem – Utilization – Contd. Overall CPU utilization - Measure using vmstat and sar  vmstat – In an ideal scenario, % system time will be less than 5%. % user time will be the real measure of CPU utilization. % wait is to be ignored completely. kthr rbw memory swap free re page disk faults in sy cs cpu us sy id 23 3 27 2 74 72 mf pi po fr de sr s9 s3 s3 s3 0 13 0 84634792 97274944 103 573 96 26 24 0 0 0 0 0 0 4377 31853 14630 0 19 0 77838000 93835392 0 21 0 0 0 0 0 0 0 0 0 5997 31834 9154  sar - Gives historical reporting. More details on how to get sar reports for previous days. mkrishna@tus1dwhdbspex01:> sar –u SunOS tus1dwhdbspex01 5.9 Generic_122300-16 sun4u 00:00:00 00:05:00 %usr 31 %sys 3 %wio 38 %idle 29 06/04/2008 9 CPU Subsystem – Utilization – Contd. Process CPU utilization  prstat – Solaris equivalent to top. Gives a considerable amount of detailed information. In order to see individual process cpu utilization and overall user cpu utilization, run as prstat –a 1.  prstat also can give a break up of the run-queue and which thread/process is currently executing. mkrishna@tus1dwhdbspex01:> prstat -a 1 PID USERNAME 29192 oraodsup NPROC USERNAME SIZE 20G SIZE RSS STATE 11G cpu522 RSS MEMORY 77% PRI NICE 10 0 TIME 57:08:28 TIME CPU PROCESS/NLWP 0:12:59 3.1% oracle/1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ CPU 26% 185 oraodsup 3579G 1969G  The STATE column gives the currently running processes and the CPU column gives the % CPU consumed by that process. The Output at the bottom gives overall CPU consumed by a specific user account. 10 CPU Subsystem – Utilization – Contd. Process CPU utilization over time • The ps command will give the overall CPU time consumed by a process. The time column shows the amount of CPU execution time. This will give an idea as to what process has consumed the most CPU time since system boot. mkrishna@tus1dwhdbspex01:> ps -e -o time,args |more TIME COMMAND 0:12 sched 4:38 /etc/init 0:00 pageout 1-23:03:57 fsflush 0:00 -sh 0:10 /usr/lib/inet/in.mpathd –a 11 CPU Subsystem – Utilization – Contd. Individual CPU Utilization The mpstat command will give the individual CPU utilization on a SMP system. Checking for CPU utilization using mpstat allows you to see if all the CPU’s are being used and if SMP is really happening. mkrishna@tus1dwhdbspex01:> mpstat CPU minf mjf xcal 0 1 36 16 0 0 926 914 intr ithr 313 18 201 1 csw icsw migr smtx 475 480 14 13 34 34 81 71 srw syscl 0 0 1125 929 usr sys 28 21 2 2 wt idl 30 34 40 44 12 CPU Subsystem – Context switching and Thread Migrations Context switch (CS) is when a kernel thread is suspended from execution on the CPU with its current state captured and another thread is scheduled. Thread migrations happen when a thread was running on a particular CPU and was CS out. Then when it is ready to run, it is scheduled on a different CPU. Thread migrations are not efficient because the current CPU Cache is not warm with the thread information. Context switches happen as a result of the below: » Time quanta expired for the running thread – Causes Involuntary CS » Higher priority thread needs to run - Causes Involuntary CS » Interrupt is being serviced (not 100% context switch) - Causes Involuntary CS » Thread has completed it’s activity or issued a blocking system call – Causes Voluntary CS An Involuntary CS means that a thread did not get an opportunity to complete it’s work. A system can have 100,000 context switches and still meet SLA’s if most of the CS are voluntary in nature. A high degree of involuntary context switches coupled with thread migrations usually means there is a CPU related bottleneck. 13 CPU Subsystem – Context switching and Thread Migrations – Contd. The tools to measure context switches and thread migrations are • • • • vmstat mpstat prstat sar vmstat mkrishna@tus1dwhdbspex01:> vmstat 1 kthr rbw memory swap free re page disk in faults sy cpu mf pi po fr de sr s9 s3 s3 s3 cs us sy id 0 13 0 84568400 97239720 102 573 95 26 24 0 0 0 0 0 0 4398 31925 14622 23 3 74 0 18 0 76750792 91878616 6 146 0 63 55 0 0 0 0 0 0 6712 45464 11107 33 3 64 mpstat mkrishna@tus1dwhdbspex01:> mpstat CPU minf mjf xcal 0 36 0 925 intr ithr 313 201 csw icsw migr smtx 474 15 34 81 srw syscl 0 1125 usr sys 28 2 wt idl 30 40 14 CPU Subsystem – Context switching and Thread Migrations – Contd. prstat – using the –m switch which shows specific process related statistics. It allows you to drill down as to which process is being context switched. mkrishna@tus1dwhdbspex01:> prstat –n 1000 -m 1 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 5221 oraodsup 100 0.0 - 0.0 0 11 2 0 oracle/1 5400 oraodsup 29 2.4 69 - 235 28 1K 0 oracle/1 sar -w – shows history of context switches mkrishna@inftpd1:> sar -w SunOS inftpd1 5.10 Generic_118833-36 sun4u 06/06/2008 00:00:00 swpin/s bswin/s swpot/s bswot/s pswch/s 00:05:00 0.00 0.0 0.00 0.0 1420 15 CPU Subsystem – Interrupts Interrupts are mechanisms by which a hardware device (HBA, Network adaptor) or application (signals) can get CPU attention. Interrupts are generated to specific CPU’s. When an interrupt is generated, the current running thread is pinned and the interrupt serviced. The pinned thread is then again resumed. Too many interrupts can hurt application performance. 16 CPU Subsystem – Interrupts contd.   Interrupts can be monitored using mpstat and vmstat vmstat mkrishna@tus1dwhdbspex01:> vmstat 1 kthr rbw memory swap free re page disk faults cpu cs us sy id mf pi po fr de sr s9 s3 s3 s3 in sy 0 13 0 84568400 97239720 102 573 95 26 24 0 0 0 0 0 0 4398 31925 14622 23 3 74 0 18 0 76750792 91878616 6 146 0 63 55 0 0 0 0 0 0 6712 45464 11107 33 3 64 mpstat mkrishna@tus1dwhdbspex01:> mpstat CPU minf mjf xcal 0 36 0 925 intr ithr 313 201 csw icsw migr smtx 474 15 34 81 srw syscl 0 1125 usr sys 28 2 wt idl 30 40 17 CPU Subsystem – System Calls System calls are API which allow a process to request privileged operations such as IO, memory allocation or new thread creation etc. It is important to identify the time spent servicing system calls. Too many system calls (such as the timed_statistics in oracle) can swamp the system. 18 CPU Subsystem – System Calls contd. System calls can be monitored using vmstat, mpstat, prstat and sar (historical trending) vmstat mkrishna@tus1dwhdbspex01:> vmstat 1 kthr rbw memory swap free re page disk in faults cpu cs us sy id mf pi po fr de sr s9 s3 s3 s3 sy 0 13 0 84568400 97239720 102 573 95 26 24 0 0 0 0 0 0 4398 31925 14622 23 3 74 0 18 0 76750792 91878616 6 146 0 63 55 0 0 0 0 0 0 6712 45464 11107 33 3 64 mpstat mkrishna@tus1dwhdbspex01:> mpstat CPU minf mjf xcal 0 36 0 925 intr ithr 313 201 csw icsw migr smtx 474 15 34 81 srw syscl 0 usr sys 28 2 wt idl 30 40 1125 19 CPU Subsystem – System Calls contd. sar -c mkrishna@inftpd1:> sar -c |more SunOS inftpd1 5.10 Generic_118833-36 sun4u 00:00:00 scall/s sread/s swrit/s 00:05:00 fork/s 0.07 06/06/2008 exec/s rchar/s wchar/s 0.07 56915 41169 1400 103 21 20 CPU Subsystem – System Calls contd. The truss utility executes the specified command and produces a trace of the system calls it performs, the signals it receives, and the machine faults it incurs. truss is very useful to trace system calls and the time spent on a particular call. 21 CPU Subsystem – System Calls contd. In 126 seconds, 50 seconds were spent on reading data from the cache folder. root@ustu-infapppdin21:> truss -c -p 23050 syscall seconds calls errors read 40.740 639326 write 1.313 29562 time .079 4286 lseek 9.807 608657 fcntl .000 35 lwp_park .197 5170 1 lwp_unpark .142 5168 poll .046 731 yield .001 29 ------------- ---sys totals: 52.328 1292964 1 usr time: 25.691 elapsed: 126.210 22 Historic sar data How do I read historical sar data? By default sar data is saved in /var/adm/sa for 7 days. BI systems have been customized to save data for 30 days. To read sar data from previous days, sar –f /var/adm/sa/sa<01-29> -q|u|w|etc 23 So much data and how do I correlate? If you have kept up with me so far, you will begin to wonder as to how do I take all this data in and make it meaningful? It depends on what you wish to accomplish. – Short Term Reactive – Check on a possible CPU bottleneck right away – Long Term – Do trending and see how usage has changed over a period of time. For long term Sar data is invaluable, however it is not complete. Sar captures run queue, utilization, system calls and context switches. However sar does not capture interrupts, icsw or thread migrations to measure CPU load, so custom scripts needs to be written to capture this information. Long term – it is difficult to capture specific process/user related information unless process accounting is installed. A better option for the above would be tools such as sitescope, 24 Short Term Reactive approach Suspect CPU Bottleneck ? Application confirms (AWR/ASH) or poor performance. System related processes are not the cause. Check using w (load average), vmstat, sar (run queue and utilization) Debug process? e.g. Do a truss –c –p and see where time is being spent by the process on which system call. From AWR/ASH reports, identify and analyze the sql. Check using prstat –m –a 1 for processes using up CPU. Check for cpu, cs, icsw and system calls for specific process(s). Identify process either from application or through ps Further action (application/sql analysis, add more cpu’s, breakup/stagger load etc) 25 Long Term Pro-Active approach Setup sar for running every 5 minutes. Either setup custom scripts to capture other data such as icsw, interrupts etc or install tools such as BMC Patrol/Sitescope/Openv iew Prepare trending reports for CPU run queue, utilization, cs, icsw, thread migrations and interrupts over a period of time (say 3 months). This gives an overall picture of CPU related metrics. Further action 26 From an oracle perspective All this is fine, how do I know oracle sees a CPU bottleneck? The answer is in AWR/ASH reports – AWR/ASH reports will show a CPU wait in the top 5 waits Table dba_hist_sysmetric_summary (Oracle 10G) The “Current OS Load: column will show the run queue as seen by oracle for every AWR snapshot. 27