Thursday, November 8, 2007

Basic Performance Tuning for the Linux OS for Datawarehouse Loads - Kernel and CPU

Apart from the basic stuff such as disabling deamons which do not provide any value in a server environment, there are a number of parameters which need to be changed from the defaults to enable Linux RAC to perform satisfactorily in Datawarehouse Environment.

Datawarehousing means high performing I/O subsystems, well tuned Virtual Memory and CPU related.

This document is more geared to using Linux in a RAC environment as currently that is about the only way, Linux can complete with the Big-Iron systems - Scale horizontally rather than vertically. This document also does not give any actual values. I believe values for the parameters need to be tested thoroughly before deploying in any environment. Values are different depending on the load and requirement of the customer.

While I am no kernel hacker, much of this information is from a tactical perspective and from real-life scenarios.

I am referring only to the 64bit, 2.6.x kernels (RHAS 4.0) which are a significant improvement over earlier versions. It is able to handle loads relatively well, but scalability and support compared to Solaris is still not good enough. The kernel referred to in this document specifically is to 2.6.9-42.

In Enterprise Deployments, compiling the kernel from source is not an option (due to lack of support), so this write-up is about settings that can be changed without requiring a recompile. Also Enterprise customers use Veritas Volume Manager/File system for storage management.

I am assuming that the systems that are being used are 64bit, 8 CPU's and have 32GB of memory – something like a Dell 2900. Looking at anything smaller than that is probably not worth the time/effort. An average of 4 nodes for a 3TB Datawarehouse with around 600-1000 users would a good start. It is a lot better to build a RAC with several big nodes rather than a lot of smaller nodes. Of course, my first choice would a Solaris RAC with E2900 nodes – extremely fast and a great OS with excellent support.

Ideally, an x86_64 node would have 2 dedicated HBA's for primary storage, 2 Interconnects for RAC and 2 Interconnects for CFS. It goes without saying that the NIC's would be from the same vendor - meaning Intel or Broadcomm throughout. There should be a minimum of 2 active paths to a Lun.

Backups for the database would happen via the SAN using Shadow Image or Veritas Flashsnap. I personally favor Flashsnap since it is lot more flexible and cost-effective. It is best to dedicate one node for backups/restores/management alone. Since a Dell 2900 costs around 10K, I think it is a good investment.

As to the model and make of the systems, any system that can sustain 8 CPU's with at least 8GB/sec system bus (CPU+Memory) and 4GB/sec I/O bandwidth (Network + Storage) should be a good start.

1. Basics -

The main areas that need to be looked at for any warehouse are the I/O subsystem, Virtual Memory, Scheduler and CPU subsystem and finally the Network Subsystem.

In Linux, these would be the fs, kernel, vm and net areas. The defaults are not meant for a Datawarehouse Work Load and need to be changed.

Most if not all tunables are located under /proc and the /sys. The primary method of changing the parameters is using sysctl or simply an echo would do. Permanent changes require entries in the /etc/sysctl.conf file.

2. Disabling Daemons

The daemons that would need to be disabled are

apmd, atd, arptables_if, autofs, cpuspeed, cups*, gpm, haldaemon, hpoj, irqbalance, isdn, kudzu, netfs, nfslock, pcmcia, portmap, rawdevices, rpc*, smartd, xfs

The default run-level should be 3 (no X-Windows).

Disable unwanted local terminals in inittab.

And goes without saying – no SElinux.

3. Kernel - CPU/Shared Memory/Interrupts/Scheduler etc

Path - /proc/sys/kernel, /proc/irq

Interrupt handling - When running a RAC system on Linux, you are going to see a ton of interrupts being generated. It is best if it were handled by a dedicated CPU(S).

First is to identity the interrupts - cat /proc/interrupts

[mkrishna@viveka] /proc$ more interrupts

CPU0 CPU1 CPU2 CPU3

0: 35170931 35189602 35199982 30298138 IO-APIC-edge timer

1: 1 1 0 1 IO-APIC-edge i8042

8: 117 141 118 104 IO-APIC-edge rtc

9: 0 0 0 0 IO-APIC-level acpi

12: 9 4 21 1 IO-APIC-edge i8042

14: 2 3 631163 2 IO-APIC-edge ide0

50: 496 14071456 0 0 PCI-MSI eth4

58: 160 0 3927374 0 PCI-MSI eth2

66: 46 0 0 70296944 PCI-MSI eth5

74: 32732541 0 0 0 PCI-MSI eth1

169: 802517 577157 818282 11375 IO-APIC-level lpfc,

177: 22 469 3303630 21 IO-APIC-level

193: 765410 610839 742104 3170 IO-APIC-level lpfc,

217: 235360 285364 2045052 1314 IO-APIC-level

233: 14633072 0 0 0 PCI-MSI eth0

Then to cd to /proc/irq and you will see all the interrupts.

[mkrishna@viveka] /proc/irq$ ls

0 1 10 11 12 13 14 15 169 177 185 193 2 217 233 3 4 5 50 58 6 66 7 74 8 9 prof_cpu_mask

cd to the directories and change the smp_affinity to the cpu mask.

echo 04 > 169/smp_affinity

CPU Affinity and Scheduler - Soft and Hard

Normally DW consists of very long running single threaded process which is more efficient if the context switches are reduced (Less CPU Ping-pong). However the new Scheduler in 2.6.x is supposed to be auto tuning and so once the kernel is built, you cannot change it. However if you have the option of building your own kernel, you can change some parameters –

http://josh.trancesoftware.com/linux/linux_cpu_scheduler.pdf

What is surprising is that there does not seem to be any stats also available on the scheduler.

In Kernel 2.6.23 and higher -

sched_compat_yield - I do not have any information on this.

Shared Memory, Semaphores and Message Queue settings -

There are a couple of formulas out there which allows you to set the shared memory, semaphores and message queue settings based on the number of connections, size of physical memory and the number of instances.

The various parameters I generally set are

Shared Memory - The defaults need to be changed.

shmall - This file shows the system wide limit on the total number of pages of shared memory. The default value is 2097152 pages. The default page size is 4096 bytes.

This equates to 8GB of shared memory which is surprisingly high for a default configuration.

shmmax - This file can be used to set the limit on the maximum size of the shared memory segment that can be created. This value defaults to 33554432 bytes (32MB). You should set this to the size of the oracle SGA + x% where x depends on other applications and the need to change the SGA. The intent is to avoid Shared Memory fragmentation.

Under no circumstances, it should be equal to the physical memory on the system. I would suggest that shmmax not exceed 75% of physical memory.

shmmni - This file specifies the system-wide maximum number of IPC shared memory segments that can be created. The default value is 4096. I would imagine that you reduce it to around 200. I would hate to see 4096 segments on my systems.

Physical Memory > shmall >= shmmax

For example -

Physical memory = 32GB

shmall = 20GB (Maximum shared memory that can be allocated)

shmmax = 18GB (Single biggest segment possible)

The point to keep in mind is that for a large Datawarehouse, you would have a PGA of around 10GB and PGA does not use Shared Memory (though it can be forced to in 10g). PGA has a tendency of running away and consuming all physical memory. So you would want to keep a good buffer always between shmall and max physical memory.

Semaphores -

All the above are set using the /proc/sys/kernel/sem variable.

The sem file contains 4 numbers defining limits for System V IPC semaphores. These fields are, in order:

* SEMMSL - the maximum number of semaphores per semaphore set.

* SEMMNS - a system-wide limit on the number of semaphores in all semaphore sets.

* SEMOPM - the maximum number of operations that may be specified in a semop(2) call.

* SEMMNI - a system-wide limit on the maximum number of semaphore identifiers.

The default values are "250 32000 32 128".

Message Queues -

msgmax - The msgmax tunable specifies the maximum allowable size of any single message in a System V IPC message queue, in bytes. msgmax must be no larger than msgmnb (the size of a queue). The default is 8192 bytes.

msgmnb - The msgmnb tunable specifies the maximum allowable total combined size of all messages queued in a single given System V IPC message queue at any one time, in bytes. The default is 16384 bytes.

msbmnb always > than msgmax

msgmni - The msgmni tunable specifies the maximum number of system-wide System V IPC message queue identifiers (one per queue). The default is 16.

The next article would speak about the Linux Virtual Subsystem.

No comments: