Saturday, November 17, 2007

NAS devices for Oracle - Netapp versus HP Enterprise File Services Clustered Gateway

Continuing on my earlier topic about Oracle and NAS, the ideal NAS device for Oracle would support just about the following (from a tactical perspective).
  • Supported by Oracle
  • Symmetric active controllers so that load-balancing, linear scaling and HA are possible.
  • 10G ethernet capabilities with Jumbo frames.
  • Supports both Unix and Linux clients.
  • Support Direct and Async I/O.
  • Ability to map luns as a initiator (Leverage existing SAN infrastructure).
  • Direct I/O to luns bypassing the NAS cache as an option. So the NAS basically functions like a gateway device doing protocol conversion only.
  • Snapshots at the NAS level and not at the SAN level. This gives the ability to use different storage for backups and better controlled.
  • Ability to carve volumes from the luns in any manner as requested by the end-user (Raid 0, RAid 10, Raid 5 etc). End user specified Stripe Widths.
  • Set the pre-fetch and write-back policies for the NFS filesystems on the NAS head.
  • Fine Tune the I/O aspects of both NAS and SAN components
  • Not to say the least - easy to support and maintain.
It goes without saying that any NAS should have a decent I/O scheduler sitting in front of the luns which aggressively sorts/merges I/O to reduce disk head movements.

While Netapp cannot do much of the above, there is a new player in the market - the HP Enterprise File Services Clustered Gateway.

The HP solution for Unix is built on the Suse Linux Platform using the Polyserve Cluster Filesystem. It is built to serve as a NAS head utilizing HP branded Storage Arrays (I assume they would support other Arrays - as long as drivers exist for Suse Linux). I personally feel this is a brilliant idea - marrying a cluster filesystem/volume manager with an OS and making a NAS head which can scale horizontally. The OS gives manageability + SAN capabilities out of the box and the cluster filesystem/volume manager gives horizontal scalability and custom volume management. While Linux is not the best performer when it comes to Networking (compared to Netapp or Solaris), with all the other goodness thrown in, it is a mighty force indeed.

It meets most of the requirements as shown above, however I have not yet had the opportunity to test and see in a live environment. I think this would be a great product to watch out for.

Thursday, November 15, 2007

NAS and SAN - How oracle could shape the future of storage technologies

Oracle is today's leading database being deployed in many many shops. Oracle in turn is one of the prime drivers for storage in a company. Without the need for a database and in turn leading to highly available databases, I doubt that the SAN technologies as seen today would be the same without Oracle. With RAC on 11g, Oracle is crossing new boundaries and again re-defining storage technologies.

I had been working for some time in putting together a simple, cost-effective and high performing RAC solution. Though the basic concept of RAC is good, getting it all to work together in a simple efficient manner is not an easy task. There are many components and for me, the SAN was the biggest pain area.

Having a strong background in Storage Technologies, I quickly came to the conclusion that in order to make it easy to deploy and maintain, one would need to replace the SAN and associated (Cluster Filesystem, Cluster Volume Manager, Clustering and such) with a simpler technology. Here comes oracle with ASM and CRS (adapted from Digital/Compaq Clusterware) to make things simpler. While the goal of ASM is admirable, it is far from a perfect solution. For instance, taking backups is a pain using RMAN. One would prefer Snapshot (either SAN or Flashsnap), but Snapshot would not work as ASM uses no visible filesystem.

The way ASM stripes files can become a bottleneck when you have a large number of disks in a single disk group. So here we go again, having to plan the disk groups, lun sizing etc etc. So for an Enterprise customer with significant amount of data, ASM is simply another bottleneck. The future is towards simpler and easier technologies to support.

Luckily Oracle comes to the rescue with support for NFS. NFS has been supported even in 9i (with some really large scale implementations), however 10g RAC is a lot more implement/support friendly. Until now, NFS was used only by dot-coms or companies who could not afford SAN's. NFS used up CPU (system time), could not be easily load-balanced in a reliable way and was simply slower than SAN's.

With the advent of 10g Ethernet, TCP offload Engines (TOE), Multiple Symmetric Active NAS heads, it looks to me that finally NAS is on par or even better than the SAN.

  • Oracle supports NFS out of the box as a file-system (either cluster or otherwise).
  • NAS is cheap, easy to deploy and maintain. (since it uses existing Ethernet infrastructure)
  • NFS is a cluster filesystem with a proven track record and inbuilt support within the OS.
  • NAS devices give the same kind of cloning/snapshot technology as the SAN for easy backups.
  • NFS v3 is really fast when configured correctly with well defined tunables. No need to search for hidden parameters etc to increase your payload.
  • 10g Ethernet, jumbo frames and TOE cards give blazing performance with low system overhead.
  • One can easily bond/multipath multiple cards at the host for superior performance (transparent to the application).
  • Now, NAS heads offer multiple active controllers with transparent failover.
  • NFS v3 supports direct i/o and oracle supports async i/o on NFS.
  • No fancy volume managers or filesystems required to support NFS. Open Standard.
  • It is platform independent and with proper planning well scalable.
  • Easy for system operators and database operators to understand and use.
Now oracle has gone a further step in embedding the NFS client directly into the oracle kernel in 11g. It has also incorporated load balancing with multiple paths to the NAS head.

While the same amount of attention needs to go into a designing a NAS solution for Oracle as a SAN (Raid configuration, lun layout, cache, controllers, failover etc), at least it is a lot simpler to deploy than a SAN and the associated baggage.

I believe that if Oracle 11g with NFS is a success, than we are going to see considerable movement from SAN to NAS. Obviously the existing SAN infrastructure would simply become a backend for NAS heads, still we can considerably reduce cost and complexity of the infrastructure.

Thursday, November 8, 2007

Basic Performance Tuning for the Linux OS for Datawarehouse Loads (contd.) - Virtual Memory

My earlier article talked about tuning the CPU subsystem in Linux to meet Datawarehouse requirements. This is a continuation of the same article and covering the Virtual Memory subsystem.

VM Subsystem


This is by far, one of the most important subsystems on Linux. Unfortunately, with every release of the kernel, you have a dozen new tunables and a couple of old ones gone, with little or no documentation whatsoever. The aim would be to use the VM only for memory management and never have to use the file-system cache. Oracle should handle the buffering (Buffer Cache) and use Direct + Async I/O for all disk activity. We should never see swapping as shown in vmstat –s.

The path for the tunables is /proc/sys/vm

swappiness - Designates how much memory should be used for the page cache.

For a database such as Oracle using ODM, you do not want to use page cache at all. ODM is Async + Direct I/O and Oracle anyways does buffering internally, so you can very well avoid unnecessary page cache. Not only does it use up the memory, it requires kernel time to swap it out to disk when there are other processes requiring the memory.

The default is 60 which is on the higher side. I would suggest setting it to a much lower value.

overcommit_memory – Make sure it is set to 0

min_free_kbytes – The amount of free memory that needs to be reserved. I suggest keeping it at 512-1024MB minimum for a 32GB system.

lower_zone_protection – This is to prevent Linux from dying with a OOM (Out of memory) error. It is a good idea to set it to around 200MB. Documentation is quite vague (not surprising) as to whether it is in pages or in MB. On a 32GB system, 200 would seem to an optimal value.

page-cluster – It is the number of pages which will be written to disk in one go in the event of swapping (which we do not ever want to see). Since modern disk subsystems can easily handle large sized requests, I would suggest setting it from the default of 3(8 pages) to around 8 (256 pages). Each page on a X86_64 system is 4K in size.

dirty_ratio and max_queue_depth - Need to do more cause and effect study on this.

TLB and Huge Pages

Huge Pages on Linux is a little bit similar to the functionality offered by (D)ISM and MPSS on Solaris. Huge Pages cannot be paged out to disk. It improves efficiency and reduces TLB misses.

hugetlb shm group – Need to add the oracle group (dba, oinstall) to this file so that the user can use tlb.

nr hugepages – Number of huge pages to allocate. The default size of a hugepage is 2MB on x64_64.

nr_hugepages = shmmax/2*1024*1024

So for 18GB of shmmax, you would set nr_hugepages to 9216.

The important thing to note is that hugepages do not get paged to disk and so if configured incorrectly, you will run into problems. Also you can see ora-4030 errors if you are running very large queries and have a runaway PGA.

So to summarize,

Physical Memory > shmall > (shmmax = nr_hugepages)

Basic Performance Tuning for the Linux OS for Datawarehouse Loads - Kernel and CPU

Apart from the basic stuff such as disabling deamons which do not provide any value in a server environment, there are a number of parameters which need to be changed from the defaults to enable Linux RAC to perform satisfactorily in Datawarehouse Environment.

Datawarehousing means high performing I/O subsystems, well tuned Virtual Memory and CPU related.

This document is more geared to using Linux in a RAC environment as currently that is about the only way, Linux can complete with the Big-Iron systems - Scale horizontally rather than vertically. This document also does not give any actual values. I believe values for the parameters need to be tested thoroughly before deploying in any environment. Values are different depending on the load and requirement of the customer.

While I am no kernel hacker, much of this information is from a tactical perspective and from real-life scenarios.

I am referring only to the 64bit, 2.6.x kernels (RHAS 4.0) which are a significant improvement over earlier versions. It is able to handle loads relatively well, but scalability and support compared to Solaris is still not good enough. The kernel referred to in this document specifically is to 2.6.9-42.

In Enterprise Deployments, compiling the kernel from source is not an option (due to lack of support), so this write-up is about settings that can be changed without requiring a recompile. Also Enterprise customers use Veritas Volume Manager/File system for storage management.

I am assuming that the systems that are being used are 64bit, 8 CPU's and have 32GB of memory – something like a Dell 2900. Looking at anything smaller than that is probably not worth the time/effort. An average of 4 nodes for a 3TB Datawarehouse with around 600-1000 users would a good start. It is a lot better to build a RAC with several big nodes rather than a lot of smaller nodes. Of course, my first choice would a Solaris RAC with E2900 nodes – extremely fast and a great OS with excellent support.

Ideally, an x86_64 node would have 2 dedicated HBA's for primary storage, 2 Interconnects for RAC and 2 Interconnects for CFS. It goes without saying that the NIC's would be from the same vendor - meaning Intel or Broadcomm throughout. There should be a minimum of 2 active paths to a Lun.

Backups for the database would happen via the SAN using Shadow Image or Veritas Flashsnap. I personally favor Flashsnap since it is lot more flexible and cost-effective. It is best to dedicate one node for backups/restores/management alone. Since a Dell 2900 costs around 10K, I think it is a good investment.

As to the model and make of the systems, any system that can sustain 8 CPU's with at least 8GB/sec system bus (CPU+Memory) and 4GB/sec I/O bandwidth (Network + Storage) should be a good start.

1. Basics -

The main areas that need to be looked at for any warehouse are the I/O subsystem, Virtual Memory, Scheduler and CPU subsystem and finally the Network Subsystem.

In Linux, these would be the fs, kernel, vm and net areas. The defaults are not meant for a Datawarehouse Work Load and need to be changed.

Most if not all tunables are located under /proc and the /sys. The primary method of changing the parameters is using sysctl or simply an echo would do. Permanent changes require entries in the /etc/sysctl.conf file.

2. Disabling Daemons

The daemons that would need to be disabled are

apmd, atd, arptables_if, autofs, cpuspeed, cups*, gpm, haldaemon, hpoj, irqbalance, isdn, kudzu, netfs, nfslock, pcmcia, portmap, rawdevices, rpc*, smartd, xfs

The default run-level should be 3 (no X-Windows).

Disable unwanted local terminals in inittab.

And goes without saying – no SElinux.

3. Kernel - CPU/Shared Memory/Interrupts/Scheduler etc

Path - /proc/sys/kernel, /proc/irq

Interrupt handling - When running a RAC system on Linux, you are going to see a ton of interrupts being generated. It is best if it were handled by a dedicated CPU(S).

First is to identity the interrupts - cat /proc/interrupts

[mkrishna@viveka] /proc$ more interrupts

CPU0 CPU1 CPU2 CPU3

0: 35170931 35189602 35199982 30298138 IO-APIC-edge timer

1: 1 1 0 1 IO-APIC-edge i8042

8: 117 141 118 104 IO-APIC-edge rtc

9: 0 0 0 0 IO-APIC-level acpi

12: 9 4 21 1 IO-APIC-edge i8042

14: 2 3 631163 2 IO-APIC-edge ide0

50: 496 14071456 0 0 PCI-MSI eth4

58: 160 0 3927374 0 PCI-MSI eth2

66: 46 0 0 70296944 PCI-MSI eth5

74: 32732541 0 0 0 PCI-MSI eth1

169: 802517 577157 818282 11375 IO-APIC-level lpfc,

177: 22 469 3303630 21 IO-APIC-level

193: 765410 610839 742104 3170 IO-APIC-level lpfc,

217: 235360 285364 2045052 1314 IO-APIC-level

233: 14633072 0 0 0 PCI-MSI eth0

Then to cd to /proc/irq and you will see all the interrupts.

[mkrishna@viveka] /proc/irq$ ls

0 1 10 11 12 13 14 15 169 177 185 193 2 217 233 3 4 5 50 58 6 66 7 74 8 9 prof_cpu_mask

cd to the directories and change the smp_affinity to the cpu mask.

echo 04 > 169/smp_affinity

CPU Affinity and Scheduler - Soft and Hard

Normally DW consists of very long running single threaded process which is more efficient if the context switches are reduced (Less CPU Ping-pong). However the new Scheduler in 2.6.x is supposed to be auto tuning and so once the kernel is built, you cannot change it. However if you have the option of building your own kernel, you can change some parameters –

http://josh.trancesoftware.com/linux/linux_cpu_scheduler.pdf

What is surprising is that there does not seem to be any stats also available on the scheduler.

In Kernel 2.6.23 and higher -

sched_compat_yield - I do not have any information on this.

Shared Memory, Semaphores and Message Queue settings -

There are a couple of formulas out there which allows you to set the shared memory, semaphores and message queue settings based on the number of connections, size of physical memory and the number of instances.

The various parameters I generally set are

Shared Memory - The defaults need to be changed.

shmall - This file shows the system wide limit on the total number of pages of shared memory. The default value is 2097152 pages. The default page size is 4096 bytes.

This equates to 8GB of shared memory which is surprisingly high for a default configuration.

shmmax - This file can be used to set the limit on the maximum size of the shared memory segment that can be created. This value defaults to 33554432 bytes (32MB). You should set this to the size of the oracle SGA + x% where x depends on other applications and the need to change the SGA. The intent is to avoid Shared Memory fragmentation.

Under no circumstances, it should be equal to the physical memory on the system. I would suggest that shmmax not exceed 75% of physical memory.

shmmni - This file specifies the system-wide maximum number of IPC shared memory segments that can be created. The default value is 4096. I would imagine that you reduce it to around 200. I would hate to see 4096 segments on my systems.

Physical Memory > shmall >= shmmax

For example -

Physical memory = 32GB

shmall = 20GB (Maximum shared memory that can be allocated)

shmmax = 18GB (Single biggest segment possible)

The point to keep in mind is that for a large Datawarehouse, you would have a PGA of around 10GB and PGA does not use Shared Memory (though it can be forced to in 10g). PGA has a tendency of running away and consuming all physical memory. So you would want to keep a good buffer always between shmall and max physical memory.

Semaphores -

All the above are set using the /proc/sys/kernel/sem variable.

The sem file contains 4 numbers defining limits for System V IPC semaphores. These fields are, in order:

* SEMMSL - the maximum number of semaphores per semaphore set.

* SEMMNS - a system-wide limit on the number of semaphores in all semaphore sets.

* SEMOPM - the maximum number of operations that may be specified in a semop(2) call.

* SEMMNI - a system-wide limit on the maximum number of semaphore identifiers.

The default values are "250 32000 32 128".

Message Queues -

msgmax - The msgmax tunable specifies the maximum allowable size of any single message in a System V IPC message queue, in bytes. msgmax must be no larger than msgmnb (the size of a queue). The default is 8192 bytes.

msgmnb - The msgmnb tunable specifies the maximum allowable total combined size of all messages queued in a single given System V IPC message queue at any one time, in bytes. The default is 16384 bytes.

msbmnb always > than msgmax

msgmni - The msgmni tunable specifies the maximum number of system-wide System V IPC message queue identifiers (one per queue). The default is 16.

The next article would speak about the Linux Virtual Subsystem.