Friday, October 5, 2007

Basic Performance Tuning settings for the Solaris OS for Datawarehouse Loads

Below are some settings I typically change/enable on the Solaris OS for a DW environment. I generally change the default settings for OS, Network , Vxvm/VXFS, Emulex settings and Storage connectivity. All these parameters need to be tested thoroughly in a non-prod environment before being implemented.

/etc/system (Both Vxvm and OS settings)

Set the File Descriptors - The defaults are way too low and need to be bumped up. For Solaris 10, increasing the rlim_fd_cur should suffice.
  • rlim_fd_max
  • rlim_fd_cur
set rlim_fd_max=8192
set rlim_fd_cur=4096


Default:
Solaris 9
    • rlim_fd_max = 1024
    • rlim_fd_cur = 64
Solaris 10
    • rlim_fd_max = 65,536
    • rlim_fd_cur = 256
To improve basic disk performance
  • Maxphys is the maximum size of physical I/O requests. If a driver sees a request larger than this size, the driver breaks the request into maxphys size chunks. File systems can and do impose their own limit. This value should be higher than all other settings (vol_maxio, vol_maxspecialio etc) such as in Filesystems/Volume Manager etc. maxphys is set in bytes.
The below sets it to 8MB.

set maxphys=8388608

Default:
Solaris 9 and 10
maxphys = 131072 (128K)

Virtual Memory values - The values below are best explained if one reads the book - Solaris Internals. The below ties into the VM. These values play a vital role during heavy memory operations.
  • maxpgio - Maximum number of page I/O requests that can be queued by the paging system. This number is divided by 4 to get the actual maximum used by the paging system. It is used to throttle the number of requests as well as to control process swapping. maxpgio is in I/O's.
The funny part is that as per Sun Docs, the Range for maxpgio is from 1 to 1024, but it can be set as high as 65536.

set maxpgio=65536
  • slowscan - Minimum number of pages per second that the system looks at when attempting to reclaim memory. Folks set either slowscan or fastscan. I prefer to set slowscan.
set slowscan=500

  • tune_t_fsflushr - Specifies the number of seconds between fsflush invocations.
set tune_t_fsflushr=5
  • autoup - Along with tune_t_flushr, autoup controls the amount of memory examined for dirty pages in each invocation and frequency of file system sync operations.
set autoup= 300

On systems with more than 16GB memory, to reduce the impact of fsflush on the system, it is best to set autoup to higher values.

Default:
Solaris 9 and 10
maxpgio = 40
slowscan = The smaller of 1/20th of physical memory in pages and 100.
tune_t_fsflushr = 5
autoup = 30

CPU Affinity and Context switches -
  • rechoose_interval - This settings tries to run threads on the same cpu it ran before. The understanding is that the cpu cache is warm and has the instructions and data for the thread improving efficiency. The rechoose_internal variable instructs the kernel on which cpu to select to run a thread if a choice needs to be made. So if a process hasn't run in rechoose_interval ticks it will be moved to another CPU. Otherwise it will continue to wait on the CPU it has been running on. A higher value of rechoose_interval "firms-up" the soft affinity. The down side is that you can end up with sluggish spreading out of processes on an application where a single processes forks a lots of children if this value is too high.
The below set's it to 150 which is a fairly good value for Datawarehouse systems. However you need to test and see if it reduces your LWP thread migrations.

set rechoose_interval=150

Default: - 3

VxVM System kernel parameters -
  • vol_maxio - IOs of a size larger than this are boken up in the Veritas VxVM layer. Physical IOs are broken up based on the disk capabilities are unaffected of the setting of the logical IO size.
The below sets it to 8MB which is the same as the maxphys.

set vxio:vol_maxio=16384

Default
: 512 sectors. Remember that 512 sectors = 256KB
  • vol_maxioctl - The size of the largest ioctl that VxVM will handle. Bigger than this and it will break it down. ODM uses ioctl, so it makes sense to make this bigger than the biggest request (reads/writes) that can be issued from Oracle. The below sets it to 128K which is the max.
set vxio:vol_maxioctl=131072

Default: 32 KB
  • vol_maxspecialio - The size of the largest value handled by an ioctl call as issued by the application (such as oracle when using ODM). The ioctl itself may be small, but it can have requested a large IO operation.
The below sets it to 8MB which is the same as the maxphys and vol_maxio.

set vxio:vol_maxspecialio=16384

Default: 512 sectors

  • vol_default_iodelay - Count in clock ticks that utilities will pause between issuing IOs,it they have been directed to throttle down speed but haven't been given a specific delay time. Utilities such as resyncronizing mirrors or rebuilding RAID-5 utilities will use this value.
set vxio:vol_default_iodelay=10

Default: 50 ticks
  • voliomem_chunk_size - The granularity of memory chunks used by VxVM when allocating or releasing system memory. A larger granularity reduces CPU overhead due to memory allocation by allowing VxVM to retain hold of a larger amount of memory.
The below sets it to 128K which is the maximum.

set vxio:voliomem_chunk_size=131072

Default: 64KB
  • voliomem_maxpool_sz - The maximum memory requested from the system by VxVM for internal purposes. This tunable has a direct impact on the performance of VxVM as it prevents one I/O operation from using all the memory in the system.
The below sets it to 128M which is the max.

set vxio:voliomem_maxpool_sz=134217728

Default: 5% of memory up to a maximum of 128MB.

Shared Memory settings -

I will cover these in a later discussion.

Vxfs Settings -

When using direct i/o (With ODM or forcedirectio options), it does not make any difference on how you set your prefetch or write-back policies for any of your volumes containing oracle data files as all I/O will bypass the file-system buffer cache. However you can set the read-ahead and write-back for other volumes - application, oracle binaries etc.

The below are the default settings for a concat filesystem (vxfs version 4.1)

read_pref_io = 65536
read_nstream = 1
read_unit_io = 65536
write_pref_io = 65536
write_nstream = 1
write_unit_io = 65536
pref_strength = 10
buf_breakup_size = 1048576
discovered_direct_iosz = 262144
max_direct_iosz = 1048576
default_indir_size = 8192
qio_cache_enable = 0
write_throttle = 0
max_diskq = 1048576
initial_extent_size = 8
max_seqio_extent_size = 2048
max_buf_data_size = 8192
hsm_write_prealloc = 0
read_ahead = 1
inode_aging_size = 0
inode_aging_count = 0
fcl_maxalloc = 162688000
fcl_keeptime = 0
fcl_winterval = 3600
oltp_load = 0

You could set much higher values than the default and see if it helps improve performance for non oracle data file volumes. Normally, you would set the read-ahead and write-back as a proportion to how your Raid Group is setup on the Array.

For e.g -

For a Raid5 RG say 6D+1 - The read-ahead would be 6*(IO size of Array) and the write-back would be the same. You could do multiples of these values too. If using a HDS Array (9585 or AMS1000), you would use 64K as the IO size and so the read-ahead and write-back can be 384K or multiples of 384K. The read-nstream and write-nstream would not be required in such cases.

For data files, it is always better if the application (such as oracle) handle the read-ahead and write-back. Oracle has it's own buffer cache and is intimately aware of what data is required to handle user requirements.

HBA Settings -

The below settings are for an Emulex HBA. There exists similar configs for qlogic too.
  • lun-queue-depth
  • tgt-queue-depth
Many if not all the storage vendors would encourage you to set lun-queue-depth to 8 and tgt-queue-depth to 0. This settings favors storage vendors and not the customer. The reasoning behind this is controllers are limited by the number of outstanding requests then can sustain and this information is not published. However it is highly improbable that you would ever reach these numbers. Setting to 8 and 0 cripples a system. Also do not set the sd_max_throttle at all. This is again vendor propagated. Today's arrays, you should not see scsi-retry errors at all.

lun-queue-depth=30;
tgt-queue-depth=256;
  • num-iocbs
  • num-bufs
  • discovery-threads
Do not go with the defaults. All these values need to be significantly bumped up to get good performance. The man-page for lpfc is good reading material. You can expect some questions from your switch vendor, however insist on the values you set.

num-iocbs=4096;
num-bufs=2048;
discovery-threads=32;

Network Settings -

I have observed that a Solaris 10 system running mpath can easily send/receive 60MB/sec on gigabits link. This is of course also dependent on how you push the traffic down the pipe. Normally I set the below parameters using the nddconfig script provided by SUNWjass.
  • tcp_maxpsz_multiplier - This parameter will mean that we are doing fewer copy operations with more data being copied per operation.
tcp_maxpsz_multiplier=10
  • tcp_wscale_always -To ensure window scaling is available at least on the receiving side.
tcp_wscale_always=1
  • tcp_cwnd_max -This parameter describes the maximum size the congestion window can be opened. This plays a vital role in gigabit links and can give you incredible results.
tcp_cwnd_max=33554432
  • tcp_max_buf - The maximum buffer size in bytes. It controls how large the send and receive buffers are set to by an application using setsockopt(3XNET). This plays a vital role in gigabit links and can give you incredible results.
tcp_max_buf=67108864
  • tcp_xmit_hiwat -This parameter influence a heuristic which determines the size of the initial send window. The actual value will be rounded up to the next multiple of the MSS, e.g. 8760 = 6 * 1460. On Solaris 10, the default is 1MB and so would not need to be changed.
tcp_xmit_hiwat=1048576
  • tcp_recv_hiwat -This parameter determines the maximum size of the initial TCP reception buffer. The specified value will be rounded up to the next multiple of the MSS. On Solaris 10, the default is 1MB.
tcp_recv_hiwat=1048576