Thursday, November 6, 2008

Assessing TLB/TSB Misses and Page Faults

TLB Misses and Page Faultsv3

TLB, TSB Misses and Page Faults I was trying to explain to my team mates about TLB misses and Page faults when I realized I was not 100% confident about it myself. I spent some time reading up Solaris Internals on this topic. I also wrote to our Sun contact and got first hand information from Sun. The below is a rather simplified description of TLB/TSB misses and Page faults. Basics Memory is divided into page sized chunks. Supported Page sizes depend on hardware platform and the Operating System. The current UltraSparc platform running Solaris 10 supports 8K, 64K, 512K and 4M pages. The cool threads servers (T2000 and newer versions) running Solaris 10 supports 256M page sizes also (512K is not supported). The Terminology - TLB, TSB , HPT, HME and TTE When a process requests memory, only virtual memory is allocated. Physical memory is not allocated yet. The first time a process requests access to a page within the allocated virtual memory, a page fault occurs. As a result, a physical page (from free lists) is then mapped to the virtual page of the process. This mapping is created by the virtual memory system in software and stored in the HPT (Hash Page Tables) in the form of HAT Mapping Entries (HME). Also a copy of the entry is inserted into the TLB and the TSB as Translation Table Entries (TTE). The TLB or Translation Lookaside Buffer is a cache of the most recently used virtual to physical memory mappings or Translation Table Entries (TTE) on the CPU. There are multiple TLBs on the CPU. There is the iTLB used to store entries for text/library and the dTLB used to store entries for data (heap/stack). The number of entries in either TLB is limited and dependent on the CPU. For example, on the UltraSparc IV+ CPU, there is the iTLB which can store 512 entries. There are 2 dTLBs, each of which can store 512 entries. Since the number of entries in the TLB is limited, there is a bigger cache of the TTEs in physical RAM called the TSB (Translation Storage Buffer). Each process has its own dedicated TSB. The default and maximum size (up to 1MB/user process) that a user process TSB can grow to, can be changed in Solaris 10. The TSB grows and shrinks as need be and each process has 2 TSBs – one for 8K, 64K and 512K pages and the other for 4M pages. The maximum memory that can be allocated to all the user TSB can also be specified. And finally an entry in the TSB requires 16 bytes. So it is easy to identify the size of the TSB to hold a specified number of entries. Page Faults The CPU first checks the TLB for the TTE and if not found (TLB Miss), checks the TSB. If not present in the TSB (TSB Miss), then it checks the HPT for the HME. If not present in the HPT, it results in a Page Fault. A Minor page fault happens when the HME is not present in the HPT, however the contents of the requested page are in physical memory. The mappings need to be re-established in the HPT and the TSB and TLB reloaded with the entries. A Major page fault happens when the HME is not present in the HPT and the contents of the requested page are paged out to the swap device. The requested page needs to be mapped back into a free page in physical memory and the contents copied from swap into the physical memory page. The entries are stored in the HPT and the TSB and TLB are reloaded again with the entries. Swap and Page in/Page out Each physical memory page has a backing store identified by a file and offset. Page outs occur when the physical page contents are migrated to the backing store and Page-in is the reverse. Anonymous memory (heap and stack) use swap as the backing store. For file caching, Solaris uses the file on disk itself as the backing store. Swap is a combination of the swap device (on disk) and free physical memory. Why and when do I need to worry about TLB/TSB misses and Page Faults? As RAM gets cheaper, it is common place to see entry level systems with 16GB of memory or more as a starting point. This is for both X-86 and proprietary Unix Systems. With more available physical memory, a DBA configures oracle with bigger SGA and PGA sizes to take advantage of the available physical memory. While the above discussion is focused entirely on the Sparc platform, the concept of pages, TLB and page tables is present for all systems. If using 8K pages (Solaris) and there is 16GB of memory, then one would require ~ 2 million mappings to address the entire physical memory. If using 4K pages (Linux), then the number of mappings would be ~4 million. For maximum efficiency, relevant entries must be accessible to the CPU with minimal delay – in TLB preferably or at worst in the TSB. However, we know the number of entries the TLB can hold is limited by hardware. The TSB for a single user process (in Solaris 10 only) can be grown to a max of 1MB (65,536 entries), so it is limited too. It would not make sense to search the HPT for every TLB/TSB miss as it costs CPU cycles to search the hash mappings for required entries. And we must avoid page faults as much as possible. From an oracle perspective, if CPU wait is one of your top waits and you have ruled out other issues such as available CPUs, CPU scheduling etc and you are seeing significant increase in page faults then it probably makes sense to look deeper into TLB/TSB misses. As always, it pays to work on improving an area which can potentially deliver the biggest impact to customer experience. From my experience, the impact of TLB/TSB misses on an oracle instance can be over emphasized (Solaris Platforms) at times. So you would be the best judge to identify if this requires further analysis. What do I need to measure? Okay, so we get the idea that more RAM and bigger memory working sizes means more mappings and it is not possible to cache all the entries in TLB/TSB. So it is inevitable that there are going to be TLB/TSB misses and possibly page faults. But how do I put a price to it? How costly is a miss? How much time is spent on servicing these misses? The answer lies in using trapstat to check the % of time spent by the CPU in servicing TLB/TSB misses. Unfortunately the tool does not give an estimate of the time spent on servicing major/minor faults. To identify the number of page faults, one uses vmstat or kstat. How do I measure and analyze the impact? Running trapstat –T will show the TLB/TSB miss with the appropriate page sizes. Trapstat needs to be run as root. As you can see below, it shows the %time spent in user mode (u) and kernel mode (k). It shows both TLB and TSB misses in a page size breakdown. cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim ----------+-------------------------------+-------------------------------+---0u 0u 8k| 64k| 0 0 0 0 0.0 0.0 0.0 0.0 0 0 0 0 0.0 | 0.0 | 0.0 | 0.0 | 1 0 0 0 0.0 0.0 0.0 0.0 0 0 0 0 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 0 u 512k| 0u 4m| -----+---------------+---------------+-0k 0k 8k| 64k| 0 0 0 0 0.0 0.0 0.0 0.0 0 0 0 0 0.0 | 0.0 | 0.0 | 0.0 | 146 0 0 0 0.0 0.0 0.0 0.0 3 0 0 0 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 0 k 512k| 0k 4m| ==========+===============================+===============================+==== ttl | 619 0.0 0 0.0 | 4137 0.0 300 0.0 | 0.0 The last line gives the overall statistics for all the CPUs. If you are seeing around 20% or more time (%tim) spent on servicing TLB/TSB misses, then it probably makes sense to revisit your page sizing for your instance. Page Faults can be observed through vmstat (minor), vmstat –s (major and minor) and kstat (major and minor). The stats from vmstat –s and kstat (reports/CPU) are cumulative in nature. mkrishna@OCPD:> vmstat 1 kthr rbw memory swap free re page mf disk faults in 559 517 sy 794 767 cpu cs 811 745 us sy id 7 7 1 92 1 92 pi po fr de sr s0 s1 s6 s9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 41966192 34216352 0 5063 0 0 0 0 41925744 34175896 0 4995 0 mkrishna@OCPD:> vmstat -s 0 micro (hat) faults 2377024933 minor (as) faults 16504390 major faults mkrishna@OCPD:> kstat |egrep 'as_fault|maj_fault' as_fault maj_fault 142199182 984358 A dTSB miss results in a search for the entry in the HPT for the relevant HME. If not found in the HPT, then it results in a page fault. So perhaps a % of the time spent on dTSB miss can be assumed to be spent on servicing page faults (minor and major)? I do not know for sure and could not find out from Sun either. Since there will always be page faults when a virtual memory page is accessed for the first time, we cannot eliminate it completely. By definition, major page faults are bad, minor page faults are better than major page faults, but still need to be avoided. Ideally minor faults should be far greater than major faults. In well configured environments I have seen the ratio of major/minor faults to be < 0.5%. Major faults can occur when there is a memory shortage and heavy page out/swap outs. I have also seen a higher number of major faults when there is extensive file system data caching or double buffering happening on Oracle databases. How do I reduce TLB/TSB misses and Page Faults from an Oracle perspective? Theoretically, to reduce the incidence of TLB/TSB misses and page faults, one would use bigger sized pages to reduce the number of entries required to map a segment and use an optimally sized TSB to prevent TSB misses (TLB being fixed in size). This is assuming that you have configured the instance correctly to fit within the available physical memory. The below would be a practical way to implement it. 1. Reduce thread migrations (Harden affinity to CPUs) - Thread affinity will ensure a thread is executed on the same CPU as before. This will improve chances that the entries for the running thread are already present in the TLB on the CPU. Thread migrations can be seen using mpstat (migr column). Thread affinity is set as system parameter – rechoose_interval. The default value for rechoose_interval is 3. For a Datawarehouse system, I normally set it to 150. 2. Oracle Shared Memory - Oracle uses shared memory (SGA) and private anonymous memory (PGA). On Solaris, Oracle uses ISM for shared memory. ISM along with other benefits enables use of 4M pages and so already uses biggest possible page size on the UltraSparc IV+ platform running Solaris 10. Also for processes sharing the same segments, TSB is shared. So by default, when using ISM for the SGA, Oracle is already well optimized for minimal TLB/TSB misses. For the cool threads platform (Solaris 10), a mix of 256M and 4M Page sizes is used for ISM segments and so is even better optimized. 3. Oracle PGA - For your PGA or private memory, the page size setting is controlled by the parameter _realfree_heap_pagesize_hint (10g). The default value is set to 64K and so should use a 64K page size. However, it does not seem to be so. I have observed that when set at 64K, it uses 8K pages only. However setting it to 512K or 4M does indeed change the page size for PGA usage to 512K or 4M. Setting this parameter results in memory being allocated in realfree_heap_pagesize_hint sized chunks (64K, 512K, 4M) and so can potentially result in wastage of memory and starve other sessions/applications of physical memory. Setting this to 512K/4M also reduces your page faults considerably. 4. TSB Configuration - Increase the size of default startup TSB (Solaris 10) to prevent TSB misses. 1 entry in the TSB requires 16 bytes. So depending on your memory allocation to the SGA and PGA, you can set the default TSB size accordingly. Each process can have up to 2TSB with one of the TSB being dedicated to service 4M Page entries. There are several configuration parameters that can be set in the /etc/system. a. default_tsb_size – The default value is 0 (8KB). 8KB will hold 512 entries. For Oracle, you have to consider both PGA and SGA usage. Let us assume you have configured 12GB for your SGA (using ISM with 4M pages as default) and 6GB PGA (using 4M page size). 12GB of SGA would require 3072 entries or 48KB TSB. 6GB of PGA would result in global memory bound of ~700MB (serial operations 175 pages of 4M each) or ~2100 MB (parallel operations – 525 pages of 4M each). So for this case, a default_tsb_size of 8K would be too small and get resized frequently. A default size of 32KB (default_tsb_size = 2) which can then grow accordingly (to a max of 1M) would be preferable. The problem with having bigger default sizes is that it consumes physical memory, which is however capped by the tsb_alloc_hiwater_factor. b. tsb_alloc_hiwater_factor – Default is 32. This setting ensures that total TSB usage on the system for user processes does not exceed 1/32 of physical memory. So if you have 32GB of memory, then total TSB usage will be capped at 1GB. If you have lots of memory to spare and expect a high number of long lived sessions connecting to the instance, then this can be reduced. c. tsb_rss_factor – Default is 384. Value of tsb_rss_factor/512 is the threshold beyond which the tsb is resized. The default setting is 75% (384/512). It probably makes sense to reduce this to 308 so that at 60% utilization of the TSB, it will get resized. d. tsb_sectsb_threshold – In Solaris 10, each process can have up to 2TSB – one for 8K, 64K and 512K pages and one for 4M pages. This setting controls the number of 4M mappings the process must have before the second TSB for 4M pages is initialized. It varies by the CPU. For a UltraSparc IV, the default is 8 pages. 5. To reduce page faults from user sessions, change _realfree_heap_pagesize_hint from 64K to either 512K or 4M. Also use ODM or Direct i/o. Avoid file system buffering for oracle data files. 6. Also ensure that the memory requirements of oracle can be met entirely within the physical memory.