My earlier article talked about tuning the CPU subsystem in Linux to meet Datawarehouse requirements. This is a continuation of the same article and covering the Virtual Memory subsystem.
This is by far, one of the most important subsystems on Linux. Unfortunately, with every release of the kernel, you have a dozen new tunables and a couple of old ones gone, with little or no documentation whatsoever. The aim would be to use the VM only for memory management and never have to use the file-system cache. Oracle should handle the buffering (Buffer Cache) and use Direct + Async I/O for all disk activity. We should never see swapping as shown in vmstat –s.
The path for the tunables is /proc/sys/vm
swappiness - Designates how much memory should be used for the page cache.
For a database such as Oracle using ODM, you do not want to use page cache at all. ODM is Async + Direct I/O and Oracle anyways does buffering internally, so you can very well avoid unnecessary page cache. Not only does it use up the memory, it requires kernel time to swap it out to disk when there are other processes requiring the memory.
The default is 60 which is on the higher side. I would suggest setting it to a much lower value.
overcommit_memory – Make sure it is set to 0
min_free_kbytes – The amount of free memory that needs to be reserved. I suggest keeping it at 512-1024MB minimum for a 32GB system.
lower_zone_protection – This is to prevent Linux from dying with a OOM (Out of memory) error. It is a good idea to set it to around 200MB. Documentation is quite vague (not surprising) as to whether it is in pages or in MB. On a 32GB system, 200 would seem to an optimal value.
page-cluster – It is the number of pages which will be written to disk in one go in the event of swapping (which we do not ever want to see). Since modern disk subsystems can easily handle large sized requests, I would suggest setting it from the default of 3(8 pages) to around 8 (256 pages). Each page on a X86_64 system is 4K in size.
dirty_ratio and max_queue_depth - Need to do more cause and effect study on this.
TLB and Huge Pages
Huge Pages on Linux is a little bit similar to the functionality offered by (D)ISM and MPSS on Solaris. Huge Pages cannot be paged out to disk. It improves efficiency and reduces TLB misses.
hugetlb shm group – Need to add the oracle group (dba, oinstall) to this file so that the user can use tlb.
nr hugepages – Number of huge pages to allocate. The default size of a hugepage is 2MB on x64_64.
nr_hugepages = shmmax/2*1024*1024
So for 18GB of shmmax, you would set nr_hugepages to 9216.
The important thing to note is that hugepages do not get paged to disk and so if configured incorrectly, you will run into problems. Also you can see ora-4030 errors if you are running very large queries and have a runaway PGA.
So to summarize,
Physical Memory > shmall > (shmmax = nr_hugepages)