Saturday, November 17, 2007

NAS devices for Oracle - Netapp versus HP Enterprise File Services Clustered Gateway

Continuing on my earlier topic about Oracle and NAS, the ideal NAS device for Oracle would support just about the following (from a tactical perspective).
  • Supported by Oracle
  • Symmetric active controllers so that load-balancing, linear scaling and HA are possible.
  • 10G ethernet capabilities with Jumbo frames.
  • Supports both Unix and Linux clients.
  • Support Direct and Async I/O.
  • Ability to map luns as a initiator (Leverage existing SAN infrastructure).
  • Direct I/O to luns bypassing the NAS cache as an option. So the NAS basically functions like a gateway device doing protocol conversion only.
  • Snapshots at the NAS level and not at the SAN level. This gives the ability to use different storage for backups and better controlled.
  • Ability to carve volumes from the luns in any manner as requested by the end-user (Raid 0, RAid 10, Raid 5 etc). End user specified Stripe Widths.
  • Set the pre-fetch and write-back policies for the NFS filesystems on the NAS head.
  • Fine Tune the I/O aspects of both NAS and SAN components
  • Not to say the least - easy to support and maintain.
It goes without saying that any NAS should have a decent I/O scheduler sitting in front of the luns which aggressively sorts/merges I/O to reduce disk head movements.

While Netapp cannot do much of the above, there is a new player in the market - the HP Enterprise File Services Clustered Gateway.

The HP solution for Unix is built on the Suse Linux Platform using the Polyserve Cluster Filesystem. It is built to serve as a NAS head utilizing HP branded Storage Arrays (I assume they would support other Arrays - as long as drivers exist for Suse Linux). I personally feel this is a brilliant idea - marrying a cluster filesystem/volume manager with an OS and making a NAS head which can scale horizontally. The OS gives manageability + SAN capabilities out of the box and the cluster filesystem/volume manager gives horizontal scalability and custom volume management. While Linux is not the best performer when it comes to Networking (compared to Netapp or Solaris), with all the other goodness thrown in, it is a mighty force indeed.

It meets most of the requirements as shown above, however I have not yet had the opportunity to test and see in a live environment. I think this would be a great product to watch out for.

Thursday, November 15, 2007

NAS and SAN - How oracle could shape the future of storage technologies

Oracle is today's leading database being deployed in many many shops. Oracle in turn is one of the prime drivers for storage in a company. Without the need for a database and in turn leading to highly available databases, I doubt that the SAN technologies as seen today would be the same without Oracle. With RAC on 11g, Oracle is crossing new boundaries and again re-defining storage technologies.

I had been working for some time in putting together a simple, cost-effective and high performing RAC solution. Though the basic concept of RAC is good, getting it all to work together in a simple efficient manner is not an easy task. There are many components and for me, the SAN was the biggest pain area.

Having a strong background in Storage Technologies, I quickly came to the conclusion that in order to make it easy to deploy and maintain, one would need to replace the SAN and associated (Cluster Filesystem, Cluster Volume Manager, Clustering and such) with a simpler technology. Here comes oracle with ASM and CRS (adapted from Digital/Compaq Clusterware) to make things simpler. While the goal of ASM is admirable, it is far from a perfect solution. For instance, taking backups is a pain using RMAN. One would prefer Snapshot (either SAN or Flashsnap), but Snapshot would not work as ASM uses no visible filesystem.

The way ASM stripes files can become a bottleneck when you have a large number of disks in a single disk group. So here we go again, having to plan the disk groups, lun sizing etc etc. So for an Enterprise customer with significant amount of data, ASM is simply another bottleneck. The future is towards simpler and easier technologies to support.

Luckily Oracle comes to the rescue with support for NFS. NFS has been supported even in 9i (with some really large scale implementations), however 10g RAC is a lot more implement/support friendly. Until now, NFS was used only by dot-coms or companies who could not afford SAN's. NFS used up CPU (system time), could not be easily load-balanced in a reliable way and was simply slower than SAN's.

With the advent of 10g Ethernet, TCP offload Engines (TOE), Multiple Symmetric Active NAS heads, it looks to me that finally NAS is on par or even better than the SAN.

  • Oracle supports NFS out of the box as a file-system (either cluster or otherwise).
  • NAS is cheap, easy to deploy and maintain. (since it uses existing Ethernet infrastructure)
  • NFS is a cluster filesystem with a proven track record and inbuilt support within the OS.
  • NAS devices give the same kind of cloning/snapshot technology as the SAN for easy backups.
  • NFS v3 is really fast when configured correctly with well defined tunables. No need to search for hidden parameters etc to increase your payload.
  • 10g Ethernet, jumbo frames and TOE cards give blazing performance with low system overhead.
  • One can easily bond/multipath multiple cards at the host for superior performance (transparent to the application).
  • Now, NAS heads offer multiple active controllers with transparent failover.
  • NFS v3 supports direct i/o and oracle supports async i/o on NFS.
  • No fancy volume managers or filesystems required to support NFS. Open Standard.
  • It is platform independent and with proper planning well scalable.
  • Easy for system operators and database operators to understand and use.
Now oracle has gone a further step in embedding the NFS client directly into the oracle kernel in 11g. It has also incorporated load balancing with multiple paths to the NAS head.

While the same amount of attention needs to go into a designing a NAS solution for Oracle as a SAN (Raid configuration, lun layout, cache, controllers, failover etc), at least it is a lot simpler to deploy than a SAN and the associated baggage.

I believe that if Oracle 11g with NFS is a success, than we are going to see considerable movement from SAN to NAS. Obviously the existing SAN infrastructure would simply become a backend for NAS heads, still we can considerably reduce cost and complexity of the infrastructure.

Thursday, November 8, 2007

Basic Performance Tuning for the Linux OS for Datawarehouse Loads (contd.) - Virtual Memory

My earlier article talked about tuning the CPU subsystem in Linux to meet Datawarehouse requirements. This is a continuation of the same article and covering the Virtual Memory subsystem.

VM Subsystem


This is by far, one of the most important subsystems on Linux. Unfortunately, with every release of the kernel, you have a dozen new tunables and a couple of old ones gone, with little or no documentation whatsoever. The aim would be to use the VM only for memory management and never have to use the file-system cache. Oracle should handle the buffering (Buffer Cache) and use Direct + Async I/O for all disk activity. We should never see swapping as shown in vmstat –s.

The path for the tunables is /proc/sys/vm

swappiness - Designates how much memory should be used for the page cache.

For a database such as Oracle using ODM, you do not want to use page cache at all. ODM is Async + Direct I/O and Oracle anyways does buffering internally, so you can very well avoid unnecessary page cache. Not only does it use up the memory, it requires kernel time to swap it out to disk when there are other processes requiring the memory.

The default is 60 which is on the higher side. I would suggest setting it to a much lower value.

overcommit_memory – Make sure it is set to 0

min_free_kbytes – The amount of free memory that needs to be reserved. I suggest keeping it at 512-1024MB minimum for a 32GB system.

lower_zone_protection – This is to prevent Linux from dying with a OOM (Out of memory) error. It is a good idea to set it to around 200MB. Documentation is quite vague (not surprising) as to whether it is in pages or in MB. On a 32GB system, 200 would seem to an optimal value.

page-cluster – It is the number of pages which will be written to disk in one go in the event of swapping (which we do not ever want to see). Since modern disk subsystems can easily handle large sized requests, I would suggest setting it from the default of 3(8 pages) to around 8 (256 pages). Each page on a X86_64 system is 4K in size.

dirty_ratio and max_queue_depth - Need to do more cause and effect study on this.

TLB and Huge Pages

Huge Pages on Linux is a little bit similar to the functionality offered by (D)ISM and MPSS on Solaris. Huge Pages cannot be paged out to disk. It improves efficiency and reduces TLB misses.

hugetlb shm group – Need to add the oracle group (dba, oinstall) to this file so that the user can use tlb.

nr hugepages – Number of huge pages to allocate. The default size of a hugepage is 2MB on x64_64.

nr_hugepages = shmmax/2*1024*1024

So for 18GB of shmmax, you would set nr_hugepages to 9216.

The important thing to note is that hugepages do not get paged to disk and so if configured incorrectly, you will run into problems. Also you can see ora-4030 errors if you are running very large queries and have a runaway PGA.

So to summarize,

Physical Memory > shmall > (shmmax = nr_hugepages)

Basic Performance Tuning for the Linux OS for Datawarehouse Loads - Kernel and CPU

Apart from the basic stuff such as disabling deamons which do not provide any value in a server environment, there are a number of parameters which need to be changed from the defaults to enable Linux RAC to perform satisfactorily in Datawarehouse Environment.

Datawarehousing means high performing I/O subsystems, well tuned Virtual Memory and CPU related.

This document is more geared to using Linux in a RAC environment as currently that is about the only way, Linux can complete with the Big-Iron systems - Scale horizontally rather than vertically. This document also does not give any actual values. I believe values for the parameters need to be tested thoroughly before deploying in any environment. Values are different depending on the load and requirement of the customer.

While I am no kernel hacker, much of this information is from a tactical perspective and from real-life scenarios.

I am referring only to the 64bit, 2.6.x kernels (RHAS 4.0) which are a significant improvement over earlier versions. It is able to handle loads relatively well, but scalability and support compared to Solaris is still not good enough. The kernel referred to in this document specifically is to 2.6.9-42.

In Enterprise Deployments, compiling the kernel from source is not an option (due to lack of support), so this write-up is about settings that can be changed without requiring a recompile. Also Enterprise customers use Veritas Volume Manager/File system for storage management.

I am assuming that the systems that are being used are 64bit, 8 CPU's and have 32GB of memory – something like a Dell 2900. Looking at anything smaller than that is probably not worth the time/effort. An average of 4 nodes for a 3TB Datawarehouse with around 600-1000 users would a good start. It is a lot better to build a RAC with several big nodes rather than a lot of smaller nodes. Of course, my first choice would a Solaris RAC with E2900 nodes – extremely fast and a great OS with excellent support.

Ideally, an x86_64 node would have 2 dedicated HBA's for primary storage, 2 Interconnects for RAC and 2 Interconnects for CFS. It goes without saying that the NIC's would be from the same vendor - meaning Intel or Broadcomm throughout. There should be a minimum of 2 active paths to a Lun.

Backups for the database would happen via the SAN using Shadow Image or Veritas Flashsnap. I personally favor Flashsnap since it is lot more flexible and cost-effective. It is best to dedicate one node for backups/restores/management alone. Since a Dell 2900 costs around 10K, I think it is a good investment.

As to the model and make of the systems, any system that can sustain 8 CPU's with at least 8GB/sec system bus (CPU+Memory) and 4GB/sec I/O bandwidth (Network + Storage) should be a good start.

1. Basics -

The main areas that need to be looked at for any warehouse are the I/O subsystem, Virtual Memory, Scheduler and CPU subsystem and finally the Network Subsystem.

In Linux, these would be the fs, kernel, vm and net areas. The defaults are not meant for a Datawarehouse Work Load and need to be changed.

Most if not all tunables are located under /proc and the /sys. The primary method of changing the parameters is using sysctl or simply an echo would do. Permanent changes require entries in the /etc/sysctl.conf file.

2. Disabling Daemons

The daemons that would need to be disabled are

apmd, atd, arptables_if, autofs, cpuspeed, cups*, gpm, haldaemon, hpoj, irqbalance, isdn, kudzu, netfs, nfslock, pcmcia, portmap, rawdevices, rpc*, smartd, xfs

The default run-level should be 3 (no X-Windows).

Disable unwanted local terminals in inittab.

And goes without saying – no SElinux.

3. Kernel - CPU/Shared Memory/Interrupts/Scheduler etc

Path - /proc/sys/kernel, /proc/irq

Interrupt handling - When running a RAC system on Linux, you are going to see a ton of interrupts being generated. It is best if it were handled by a dedicated CPU(S).

First is to identity the interrupts - cat /proc/interrupts

[mkrishna@viveka] /proc$ more interrupts

CPU0 CPU1 CPU2 CPU3

0: 35170931 35189602 35199982 30298138 IO-APIC-edge timer

1: 1 1 0 1 IO-APIC-edge i8042

8: 117 141 118 104 IO-APIC-edge rtc

9: 0 0 0 0 IO-APIC-level acpi

12: 9 4 21 1 IO-APIC-edge i8042

14: 2 3 631163 2 IO-APIC-edge ide0

50: 496 14071456 0 0 PCI-MSI eth4

58: 160 0 3927374 0 PCI-MSI eth2

66: 46 0 0 70296944 PCI-MSI eth5

74: 32732541 0 0 0 PCI-MSI eth1

169: 802517 577157 818282 11375 IO-APIC-level lpfc,

177: 22 469 3303630 21 IO-APIC-level

193: 765410 610839 742104 3170 IO-APIC-level lpfc,

217: 235360 285364 2045052 1314 IO-APIC-level

233: 14633072 0 0 0 PCI-MSI eth0

Then to cd to /proc/irq and you will see all the interrupts.

[mkrishna@viveka] /proc/irq$ ls

0 1 10 11 12 13 14 15 169 177 185 193 2 217 233 3 4 5 50 58 6 66 7 74 8 9 prof_cpu_mask

cd to the directories and change the smp_affinity to the cpu mask.

echo 04 > 169/smp_affinity

CPU Affinity and Scheduler - Soft and Hard

Normally DW consists of very long running single threaded process which is more efficient if the context switches are reduced (Less CPU Ping-pong). However the new Scheduler in 2.6.x is supposed to be auto tuning and so once the kernel is built, you cannot change it. However if you have the option of building your own kernel, you can change some parameters –

http://josh.trancesoftware.com/linux/linux_cpu_scheduler.pdf

What is surprising is that there does not seem to be any stats also available on the scheduler.

In Kernel 2.6.23 and higher -

sched_compat_yield - I do not have any information on this.

Shared Memory, Semaphores and Message Queue settings -

There are a couple of formulas out there which allows you to set the shared memory, semaphores and message queue settings based on the number of connections, size of physical memory and the number of instances.

The various parameters I generally set are

Shared Memory - The defaults need to be changed.

shmall - This file shows the system wide limit on the total number of pages of shared memory. The default value is 2097152 pages. The default page size is 4096 bytes.

This equates to 8GB of shared memory which is surprisingly high for a default configuration.

shmmax - This file can be used to set the limit on the maximum size of the shared memory segment that can be created. This value defaults to 33554432 bytes (32MB). You should set this to the size of the oracle SGA + x% where x depends on other applications and the need to change the SGA. The intent is to avoid Shared Memory fragmentation.

Under no circumstances, it should be equal to the physical memory on the system. I would suggest that shmmax not exceed 75% of physical memory.

shmmni - This file specifies the system-wide maximum number of IPC shared memory segments that can be created. The default value is 4096. I would imagine that you reduce it to around 200. I would hate to see 4096 segments on my systems.

Physical Memory > shmall >= shmmax

For example -

Physical memory = 32GB

shmall = 20GB (Maximum shared memory that can be allocated)

shmmax = 18GB (Single biggest segment possible)

The point to keep in mind is that for a large Datawarehouse, you would have a PGA of around 10GB and PGA does not use Shared Memory (though it can be forced to in 10g). PGA has a tendency of running away and consuming all physical memory. So you would want to keep a good buffer always between shmall and max physical memory.

Semaphores -

All the above are set using the /proc/sys/kernel/sem variable.

The sem file contains 4 numbers defining limits for System V IPC semaphores. These fields are, in order:

* SEMMSL - the maximum number of semaphores per semaphore set.

* SEMMNS - a system-wide limit on the number of semaphores in all semaphore sets.

* SEMOPM - the maximum number of operations that may be specified in a semop(2) call.

* SEMMNI - a system-wide limit on the maximum number of semaphore identifiers.

The default values are "250 32000 32 128".

Message Queues -

msgmax - The msgmax tunable specifies the maximum allowable size of any single message in a System V IPC message queue, in bytes. msgmax must be no larger than msgmnb (the size of a queue). The default is 8192 bytes.

msgmnb - The msgmnb tunable specifies the maximum allowable total combined size of all messages queued in a single given System V IPC message queue at any one time, in bytes. The default is 16384 bytes.

msbmnb always > than msgmax

msgmni - The msgmni tunable specifies the maximum number of system-wide System V IPC message queue identifiers (one per queue). The default is 16.

The next article would speak about the Linux Virtual Subsystem.

Friday, October 5, 2007

Basic Performance Tuning settings for the Solaris OS for Datawarehouse Loads

Below are some settings I typically change/enable on the Solaris OS for a DW environment. I generally change the default settings for OS, Network , Vxvm/VXFS, Emulex settings and Storage connectivity. All these parameters need to be tested thoroughly in a non-prod environment before being implemented.

/etc/system (Both Vxvm and OS settings)

Set the File Descriptors - The defaults are way too low and need to be bumped up. For Solaris 10, increasing the rlim_fd_cur should suffice.
  • rlim_fd_max
  • rlim_fd_cur
set rlim_fd_max=8192
set rlim_fd_cur=4096


Default:
Solaris 9
    • rlim_fd_max = 1024
    • rlim_fd_cur = 64
Solaris 10
    • rlim_fd_max = 65,536
    • rlim_fd_cur = 256
To improve basic disk performance
  • Maxphys is the maximum size of physical I/O requests. If a driver sees a request larger than this size, the driver breaks the request into maxphys size chunks. File systems can and do impose their own limit. This value should be higher than all other settings (vol_maxio, vol_maxspecialio etc) such as in Filesystems/Volume Manager etc. maxphys is set in bytes.
The below sets it to 8MB.

set maxphys=8388608

Default:
Solaris 9 and 10
maxphys = 131072 (128K)

Virtual Memory values - The values below are best explained if one reads the book - Solaris Internals. The below ties into the VM. These values play a vital role during heavy memory operations.
  • maxpgio - Maximum number of page I/O requests that can be queued by the paging system. This number is divided by 4 to get the actual maximum used by the paging system. It is used to throttle the number of requests as well as to control process swapping. maxpgio is in I/O's.
The funny part is that as per Sun Docs, the Range for maxpgio is from 1 to 1024, but it can be set as high as 65536.

set maxpgio=65536
  • slowscan - Minimum number of pages per second that the system looks at when attempting to reclaim memory. Folks set either slowscan or fastscan. I prefer to set slowscan.
set slowscan=500

  • tune_t_fsflushr - Specifies the number of seconds between fsflush invocations.
set tune_t_fsflushr=5
  • autoup - Along with tune_t_flushr, autoup controls the amount of memory examined for dirty pages in each invocation and frequency of file system sync operations.
set autoup= 300

On systems with more than 16GB memory, to reduce the impact of fsflush on the system, it is best to set autoup to higher values.

Default:
Solaris 9 and 10
maxpgio = 40
slowscan = The smaller of 1/20th of physical memory in pages and 100.
tune_t_fsflushr = 5
autoup = 30

CPU Affinity and Context switches -
  • rechoose_interval - This settings tries to run threads on the same cpu it ran before. The understanding is that the cpu cache is warm and has the instructions and data for the thread improving efficiency. The rechoose_internal variable instructs the kernel on which cpu to select to run a thread if a choice needs to be made. So if a process hasn't run in rechoose_interval ticks it will be moved to another CPU. Otherwise it will continue to wait on the CPU it has been running on. A higher value of rechoose_interval "firms-up" the soft affinity. The down side is that you can end up with sluggish spreading out of processes on an application where a single processes forks a lots of children if this value is too high.
The below set's it to 150 which is a fairly good value for Datawarehouse systems. However you need to test and see if it reduces your LWP thread migrations.

set rechoose_interval=150

Default: - 3

VxVM System kernel parameters -
  • vol_maxio - IOs of a size larger than this are boken up in the Veritas VxVM layer. Physical IOs are broken up based on the disk capabilities are unaffected of the setting of the logical IO size.
The below sets it to 8MB which is the same as the maxphys.

set vxio:vol_maxio=16384

Default
: 512 sectors. Remember that 512 sectors = 256KB
  • vol_maxioctl - The size of the largest ioctl that VxVM will handle. Bigger than this and it will break it down. ODM uses ioctl, so it makes sense to make this bigger than the biggest request (reads/writes) that can be issued from Oracle. The below sets it to 128K which is the max.
set vxio:vol_maxioctl=131072

Default: 32 KB
  • vol_maxspecialio - The size of the largest value handled by an ioctl call as issued by the application (such as oracle when using ODM). The ioctl itself may be small, but it can have requested a large IO operation.
The below sets it to 8MB which is the same as the maxphys and vol_maxio.

set vxio:vol_maxspecialio=16384

Default: 512 sectors

  • vol_default_iodelay - Count in clock ticks that utilities will pause between issuing IOs,it they have been directed to throttle down speed but haven't been given a specific delay time. Utilities such as resyncronizing mirrors or rebuilding RAID-5 utilities will use this value.
set vxio:vol_default_iodelay=10

Default: 50 ticks
  • voliomem_chunk_size - The granularity of memory chunks used by VxVM when allocating or releasing system memory. A larger granularity reduces CPU overhead due to memory allocation by allowing VxVM to retain hold of a larger amount of memory.
The below sets it to 128K which is the maximum.

set vxio:voliomem_chunk_size=131072

Default: 64KB
  • voliomem_maxpool_sz - The maximum memory requested from the system by VxVM for internal purposes. This tunable has a direct impact on the performance of VxVM as it prevents one I/O operation from using all the memory in the system.
The below sets it to 128M which is the max.

set vxio:voliomem_maxpool_sz=134217728

Default: 5% of memory up to a maximum of 128MB.

Shared Memory settings -

I will cover these in a later discussion.

Vxfs Settings -

When using direct i/o (With ODM or forcedirectio options), it does not make any difference on how you set your prefetch or write-back policies for any of your volumes containing oracle data files as all I/O will bypass the file-system buffer cache. However you can set the read-ahead and write-back for other volumes - application, oracle binaries etc.

The below are the default settings for a concat filesystem (vxfs version 4.1)

read_pref_io = 65536
read_nstream = 1
read_unit_io = 65536
write_pref_io = 65536
write_nstream = 1
write_unit_io = 65536
pref_strength = 10
buf_breakup_size = 1048576
discovered_direct_iosz = 262144
max_direct_iosz = 1048576
default_indir_size = 8192
qio_cache_enable = 0
write_throttle = 0
max_diskq = 1048576
initial_extent_size = 8
max_seqio_extent_size = 2048
max_buf_data_size = 8192
hsm_write_prealloc = 0
read_ahead = 1
inode_aging_size = 0
inode_aging_count = 0
fcl_maxalloc = 162688000
fcl_keeptime = 0
fcl_winterval = 3600
oltp_load = 0

You could set much higher values than the default and see if it helps improve performance for non oracle data file volumes. Normally, you would set the read-ahead and write-back as a proportion to how your Raid Group is setup on the Array.

For e.g -

For a Raid5 RG say 6D+1 - The read-ahead would be 6*(IO size of Array) and the write-back would be the same. You could do multiples of these values too. If using a HDS Array (9585 or AMS1000), you would use 64K as the IO size and so the read-ahead and write-back can be 384K or multiples of 384K. The read-nstream and write-nstream would not be required in such cases.

For data files, it is always better if the application (such as oracle) handle the read-ahead and write-back. Oracle has it's own buffer cache and is intimately aware of what data is required to handle user requirements.

HBA Settings -

The below settings are for an Emulex HBA. There exists similar configs for qlogic too.
  • lun-queue-depth
  • tgt-queue-depth
Many if not all the storage vendors would encourage you to set lun-queue-depth to 8 and tgt-queue-depth to 0. This settings favors storage vendors and not the customer. The reasoning behind this is controllers are limited by the number of outstanding requests then can sustain and this information is not published. However it is highly improbable that you would ever reach these numbers. Setting to 8 and 0 cripples a system. Also do not set the sd_max_throttle at all. This is again vendor propagated. Today's arrays, you should not see scsi-retry errors at all.

lun-queue-depth=30;
tgt-queue-depth=256;
  • num-iocbs
  • num-bufs
  • discovery-threads
Do not go with the defaults. All these values need to be significantly bumped up to get good performance. The man-page for lpfc is good reading material. You can expect some questions from your switch vendor, however insist on the values you set.

num-iocbs=4096;
num-bufs=2048;
discovery-threads=32;

Network Settings -

I have observed that a Solaris 10 system running mpath can easily send/receive 60MB/sec on gigabits link. This is of course also dependent on how you push the traffic down the pipe. Normally I set the below parameters using the nddconfig script provided by SUNWjass.
  • tcp_maxpsz_multiplier - This parameter will mean that we are doing fewer copy operations with more data being copied per operation.
tcp_maxpsz_multiplier=10
  • tcp_wscale_always -To ensure window scaling is available at least on the receiving side.
tcp_wscale_always=1
  • tcp_cwnd_max -This parameter describes the maximum size the congestion window can be opened. This plays a vital role in gigabit links and can give you incredible results.
tcp_cwnd_max=33554432
  • tcp_max_buf - The maximum buffer size in bytes. It controls how large the send and receive buffers are set to by an application using setsockopt(3XNET). This plays a vital role in gigabit links and can give you incredible results.
tcp_max_buf=67108864
  • tcp_xmit_hiwat -This parameter influence a heuristic which determines the size of the initial send window. The actual value will be rounded up to the next multiple of the MSS, e.g. 8760 = 6 * 1460. On Solaris 10, the default is 1MB and so would not need to be changed.
tcp_xmit_hiwat=1048576
  • tcp_recv_hiwat -This parameter determines the maximum size of the initial TCP reception buffer. The specified value will be rounded up to the next multiple of the MSS. On Solaris 10, the default is 1MB.
tcp_recv_hiwat=1048576

Tuesday, September 18, 2007

Raid 5 Versus Raid 10 for Datawarehouse

Since these days there is considerable efforts to reduce costs and meet Wall street demands, Raid 5 has increased in popularity. Raid 5 on the surface offers redundancy and reasonable performance depending on the profile of the application. There is an increasing push for using Raid 5 for Warehousing too.

However as someone famous said

"Money shows us the price of everything but never the real cost" - so befitting Raid 5.

Cost is taken over the life-time of the solution and could be equated to the TCO. Cost is expecting a report to run in 5 minutes but taking 30 minutes. Cost is the loss of time in waiting for those 25 minutes. Unfortunately, today's Wall Street Guru's are so short sighted - cheap today = money saved today = *10 times money lost over 3 years. Penny wise and Pound Foolish!

Raid 5 would be ideal in a scenario wherein there are small writes (all activity to arrays is more or less random) and small reads. Small meaning IOPS are sized approximately between 4k to 32K. Number of IOPS would depend on the Controller capability. The theory behind this is that the Array caches the writes and then de-stages it to the disks. Similary small reads (if perceived to be sequential) are pre-fetched by the controllers and since it is small reads, there is no big penalty for slower reads.

The above profile can be classified as a Classic OLTP Database.

Take the case of DW - typical load patterns are
  1. Large Random Reads - IOPS vary in size from 128K to 1M. IOPS can be anywhere between 500 to 10,000 IOPS. It is fairly common to see traffic ranging from 350MB/sec to 600MB/sec during periods of heavy activity.
  2. Large Random writes - Same as above.
The general impression is DW consists of large sequential I/O and from a Oracle perspective, it is true.

I/O from an Oracle perspective -

Oracle ---> Segments--> Extents-->DB Blocks-->Filesystem Blocks--> Volume-->Luns

While the access profile from segment to extent to db block may be sequential, on the filesystem and volume to lun it may not be sequential. If Oracle has 500 data files and these data files are spread out across 10 volumes on 50 luns and Oracle is doing I/O to 300 of those files, then from a volume/lun perspective, I/O would mostly be random.

It is up to the various I/O schedulers (Volume, HBA, Array) to merge/sort the I/O requests to minimize head movement. All this has to happen within a short period of time, otherwise you would see I/O starvation. To sum it up, one can say that access from an Array perspective would be mostly random.

So how does Raid 5 handle Random Reads/Writes?

Raid 5 on a Storage Array is typically denoted as 4D+1 or 6D+1 Raid Groups etc wherein it means 4 or 6 data drives and 1 parity drive. Data is written in stripes (stripe width on a disk ranging from 64K to 384K or higher) with a parity bit generated for every stripe and written on one of the drives. Arrays do not give you the option of setting big stripe widths (>1MB) due to the cost of parity calculation.

Parity generation is a CPU intensive task and can be quite tasking when activity on the array is high. Bigger the stripe width, more time spent on calculating the parity.

During a Random Read, prefetch is not going to work as data access is random. There are going to be cache misses and since data is striped, it can be retrieved only in stripe widths . So if the database requests 2MB of records, then it has to divide this 2MB by the stripe width and perform those many operations to retrieve the data.

2MB of data = 32 reads (64K width and worsest case).

Worser is Random Writes - Oracle issues a write to a block and the Array has to retrieve the entire stripe holding the data -

E.g. -

  • Raid Group = 4D+1
  • Size of Lun = 10G.
  • Stripe Width = 64K (on a single disk)
  • Number of Stripes = 10G/64K*4 = ~40000
If Data is in Stripe 10001, then it has to retrieve the entire stripe 10001, make the change, calculate the new parity and then update the parity bit for that stripe.

Under normal circumstances, the Array has sufficient cache to hold these writes and then destage them to the disk. The array will receive the change in cache and issue a write complete to the host.

However cache is limited and when the cache fills up, the host volumes will suddenly show a huge spike in write timings (>1000ms).

The ideal storage for DW is Raid 1 on the Array and Raid 0 using these luns on the system. In my experience, Modular Arrays such as the Hitachi 9585 or AMS1000 perform ideally for DW requirements - provided they are configured correctly. In tests which I performed, Raid 1+0
simply ran circles around Raid 5 especially when it came to Large Random Reads.

Oracle suggestions of using SAME (Stripe and Mirror) are very good for Datawarehousing.


Veritas Volume Manager does a better job than an array in handling writes/reads to striped volumes. The idea is to let Volume Manager handle the pre-fetch and the write staging.

So how does an Array handle Raid 1 requests?

Random Reads - No prefetch as this is random in nature, however reads simultaneously from both drives in chunks of 500K or as per Array specs. Volume Manager does any prefetch required.

Random Writes - Uses Cache to destage. No parity calculation, so much faster writes. Writes are expensive because we are writing to 2 disks, but not so expensive as writing to 5 disks or more as in Raid 5.

Raid 0 using 1MB Stripe Width on the system will help in large reads. Writes will be efficient if they occur in multiple of 1MB chunks.

All in all, Raid 1+0 is a winner. It is best to use Modular arrays as these are cheaper and can be dedicated to Warehousing.

Monday, September 17, 2007

Today - Sept 17, 2007, I finally took the step to start my own blog.