Tuesday, September 30, 2008

Redo configuration for a High Transaction Environment

We resolved a number of LGWR issues and the steps we took are detailed in the presentation.


Redo Configuration for Sustained and Scalable Performance in a Real World Environment Krishna Manoharan krishmanoh@gmail.com 1 Background During routine capacity planning of an Operational Data Store, it was noticed that large extracts were not scaling as predicted. Among other things, higher log related waits were noticed when conducting the analysis. A study was performed to reduce/eliminate log related waits. 2 The Environment Oracle 10.2.0.3 Enterprise Edition 64bit on Solaris 9 Database Size – 4 TB SunFire E4800 with 12 CPU’s, 48GB RAM and 2 IO Boards. Work Load – Numerous small transactions due to replication and large batch process through ETL. Operations in logging mode. Commits done for every row change (replication) to commits after million+ rows changes (ETL). 3 Work Load Profile (redo) of the instance (Peak Hours) # Session Session Session Session LGWR LGWR LGWR Stat commit wait requested commit wait performed redo entries generated redo size redo synch writes redo writes lgwr throughput Value 72/sec 72/sec 13326 entries/sec 9.45MB/sec 72 sync writes/sec 74 writes/sec 9.8 MB/sec 4 Objectives of the study Deliver predicted scalability by Reduce/Eliminate log related waits. Improve on log related statistics. Eliminate LGWR/ARCH as a bottleneck. Performance as measured by Improved transaction rates. Meeting pre-defined thresholds for waits and stats. Build a standard for an optimal and scalable redo/archive configuration. Performance tuning measure – not capacity planning. Limited to configuration only, no code changes. 5 Symptoms from Oracle Top log related wait statistics (Peak Hours) Existing (average wait time) Negligible Negligible 28.6 ms 11.8 ms 7.02 ms 20.82 ms 26.45 ms 0.38 ms Threshold (average wait time) Eliminate Eliminate Eliminate < 2ms < 10ms ? < 5 ms Eliminate Event latch: redo allocation latch: redo writing log buffer space log file parallel write log file sequential read log file switch completion log file sync log file switch (checkpoint incomplete) Waits/sec Negligible Negligible 3.5 waits/sec 74 waits/sec 10 waits/sec 0.5 waits/sec 72 waits/sec 0.5 waits/sec 6 Symptoms from Oracle – contd. Top log related instance statistics (During peak hours) # LGWR LGWR Session Session Session Statistic Existing Threshold redo sync response time 0.27 ms/sync write < 0.1 ms redo write response time 0.12ms/write < 0.1 ms/write redo buffer allocation retries 0.0002 retries/entry, 3.8 retries/second Eliminate redo log space requests 0.000081/entry, 1 request/second Eliminate redo log space wait time 0.12ms/wait for space < 0.1 ms/wait for space 7 Symptoms from the System – contd. Top system related statistics (During peak hours) # CPU CPU Parameter Average Run Queue Average Involuntary context switches for LGWR Before 1.2 30% 8ms (Reads) 12ms (Write) 8ms (Reads) 12ms (Write) 8ms (Reads) 12ms (Write) Threshold No Change Eliminate 10ms (Reads) 2ms (Write) 10ms (Reads) 2ms (Write) 10ms (Reads) 2ms (Write) Storage Average Redo Lun response time Storage Average File response time (redo logs) Storage Average Redo Volume Response Time 8 Existing Log Configuration (Instance) No _ parameters set. log_buffer – Default (Seen as 4MB) Redo log groups – 3 Size of members – 500M log_archive_max_processes – Default (seen as 2) Using VRTS ODM and Vxfs 9 Existing System/Storage Configuration Default scheduling class – TS and Default priorities for Oracle. Thread Affinity set to 150. Storage Foundation 4.1 MP2 for Oracle Enterprise Edition. Maxphys set to 8M (system and Vxvm) Lun Queue Depth – 30 with a max of 256/Target. All luns – Raid 1 using 72GB, 15K RPM FC Drives. Storage – Hitachi AMS1000 Dual Simultaneous Active (2Gbit) Paths to each lun. Load Balancing via vxdmp. 10 Existing Physical Log Configuration (Filesystem) Redo Members LOG01A.dbf - 500M (Primary of Group 1) LOG02A.dbf - 500M (Primary of Group 2) LOG03A.dbf - 500M (Primary of Group 3) Redo Members LOG01B.dbf - 500M (Mirror of Group 1) LOG02B.dbf - 500M (Mirror of Group 2) LOG03B.dbf - 500M (Mirror of Group 3) Filesystem 1 (vxfs) /u05/redo1 Filesystem 2 (vxfs) /u05/redo2 Raid 1 Single Lun of 66.4GB } 72GB, 15K RPM FC 72GB, 15K RPM FC 11 Log Waits - Schematic User Session Event Log File Parallel Write CPU Subsystem Event Log File Switch Completion Log Group 1 Redo Logs Log Group 2 Shared Log Buffer Private Redo Strands Archive Log Files SGA CPU Subsystem Event Log Archive IO ARCH Event Log File Sequential Read } LGWR Event Log File Sync Event Log Buffer Space IO Subsystem 12 Analysis of the symptoms LGWR related Wait - log file parallel write - “Writing redo records to the redo log files from the log buffer. The wait time is the time it takes for the I/Os to complete.” High average wait time (11.8 ms). Correlating Stats High Lun response time for the redo log filesystem. High redo sync time (0.27 ms/sync write). High redo write time (0.12 ms/write). Log redo sync writes (72 writes/sec). Higher buffer allocation retries (3.8 retries/sec). High degree of involuntary context switches for the LGWR process. 13 Analysis of the symptoms (contd.) Wait - log file parallel write (contd.) With a high degree of commits, it is more important to review the average response time/wait rather than the number of waits. Every commit write wait results in a increment to the log file parallel write event. The physical layout of the redo log filesystems show a single lun used for all the groups. Since redo log members are relatively small sized, it is common practice for the Storage/System Admin to assign a single lun which is then used for all the redo filesystems. This invariably leads to poor IO performance and a slower LGWR. A slower LGWR also results in poor commit performance (sync writes) as evidenced by correlating stats. A slower LGWR results in higher buffer allocation retries because the LGWR is unable to write the redo entries and flush the buffer to the disk fast enough to meet the session requirements. The overall run queue on the system was low, however involuntary context switching (~30%) indicated that LGWR was being switched out of the CPU before it could complete it’s task. The high lun response time for the redo log filesystem’s indicated that IO was a bottleneck. 14 Analysis of the symptoms (contd.) LGWR related Wait - log file switch completion – “Waiting for a log switch to complete” Wait - log file switch (checkpoint incomplete) – “Waiting for a log switch because the session cannot wrap into the next log. Wrapping cannot be performed because the checkpoint for that log has not completed” Large number of waits (0.5 waits/sec) with high average wait time (20.82 ms). Correlating Stats redo log space requests – 1 request/sec redo log space wait time – 0.12ms/entry 15 Analysis of the symptoms (contd.) Wait - log file switch completion (contd.) During a log file switch, redo generation is disabled. So this wait directly impacts session performance. The log members were only 500M in size and thus causing frequent log switches (every 1 minute). This will result in higher waits. The log_buffer is 4M in size and during a log switch, the log buffer is flushed to disk. If there is an IO bottleneck to the redo log files, then flushing 4M of log buffer could result in higher response times. Since the redo log groups were on the same set of Filesystems, there could possibly be a IO conflict between the checkpoint and LGWR processes when doing a log switch as shown in the wait log file switch (checkpoint incomplete). However a bigger log file can also cause slower log file switches. The impact of increasing the log member size needs to be studied with respect to the event – log file switch completion. 16 Analysis of the symptoms (contd.) ARCH related Wait - log file sequential read – “Waiting for the read from this logfile to return. This is used to read redo records from the log file – either for recovery or archival. The wait time is the time it takes to complete the physical I/O (read)” High number of waits (10 waits/sec) with high average wait time (7 ms). Correlating Stats High Lun response time for the redo log filesystem. Event – log file parallel write (high average wait time – 11.8 ms) 17 Analysis of the symptoms (contd.) Wait - log file sequential read (contd.) Small sized redo log members cause frequent log switches (1/minute). These logs need to be archived and thus indirectly impacts the event log file sequential read. Members of the redo groups were located on the same filesystems and share the same physical LUNS. This results in IO contention because the ARCH process is reading from the previous group as the LGWR is writing to the present group . This in turn impacts LGWR write performance thus resulting in increased response time for the events - log file parallel writes and log file sync waits. Poor archival performance can also indirectly impact log switches as reported in the event – log file switch (archival incomplete) and thus session performance. For 500 M log members, the average response time is on the higher side again indicating an IO contention. Since redo log members are relatively small sized, it is common practice for the Storage/System Admin to assign a single lun which is then used for all the redo filesystems. The nature of the access being sequential, this problem is multiplied in effect – especially if the lun is Raid 5. Increasing log file sizes can also cause this event to report higher wait times. 18 Analysis of the symptoms (contd.) Session related Wait - log buffer space – “Waiting for space in the log buffer because the session is writing data into the log buffer faster than LGWR can write it out” High number of log buffer space waits (3.5 waits/sec) with an average response time of 28.6 ms. Correlating Stats Event – log file parallel write (high average wait time) redo buffer allocation retries (3.8 retries/sec). 19 Analysis of the symptoms (contd.) Wait - log buffer space (contd.) This along with the high response time for the log file parallel writes wait shows a slow LGWR. The presence of higher redo log buffer allocation retries also correlate this wait. It also can mean that the default log buffer (4MB) is too small for the rate of redo generation (9.45 MB/sec). During a log switch, LGWR flushes the log_buffer to disk. So the impact of increasing the size of the log_buffer needs to be analyzed with respect to the event – log file switch completion. 20 Analysis of the symptoms (contd.) Session related Wait - log file sync – “When a user session commits, the session's redo information needs to be flushed to the redo logfile. The user session will post the LGWR to write the log buffer to the redo log file. When the LGWR has finished posting, it will post the user session. The wait time includes the writing of the log buffer and the post.” The average wait time was 26.45 ms. Correlating Stats Event – log file parallel write (high average wait time – 11.8 ms) High degree of involuntary context switches for both user session and LGWR. 21 Analysis of the symptoms (contd.) Wait - log file sync (contd.) Every commit write wait/immediate will result in an increment of the wait counter and a redo write (resulting in an increment to the log file parallel write wait counter). Rather than the number of waits, the average wait time is important for this wait event. Under ideal circumstances, the average wait time for a log file sync event should be the about the same as the average wait time for the wait – log file parallel write. If there is a difference, then it probably indicates a CPU bottleneck for the session. Higher wait times can be a result of slow LGWR as well as CPU bottleneck (evidenced by high involuntary context switches for session processes) 22 Initial Conclusions From the waits and stats, we came to following conclusions LGWR The underlying IO subsystem for the redo logs needed to be improved. The redo log members needed to be resized from 500M to a suitable size. Also increase the groups from 3 to 4. Reduce LGWR involuntary switches by addressing OS scheduling issues. 23 Initial Conclusions (contd.) ARCH Separate the redo log groups onto dedicated filesystems to prevent contention between ARCH and LGWR. Session Increase log buffer from the default to a suitable value taking into consideration impact on the event log file switch completion. 24 Final Configuration Details After 30 or so runs, we finally arrived at the below optimal configuration. Redo Filesystem configuration (to address IO issues) Striped filesystems on dedicated Raid 1 luns configured for the redo logs as shown in the next slide. Filesystem is vxfs with 8k block size. Stripe Width = 1M LGWR configuration (to address involuntary context switches) The FX scheduling class was set for the LGWR process. The CPU time quantum was increased to 1000 and the priority set to 59. # priocntl -s -c FX -m 59 -p 59 -t 1000 -i pid The thread affinity was set to 150 for the entire system, however we decided it to be best if the LGWR was bound to a specific CPU. # psrset –c # psrset –b 1 25 New Filesystem Layout and redo group placement Redo Members LOG01A.dbf - 1500M (Primary of Group 1) LOG03A.dbf -1 500M (Primary of Group 3) Redo Members LOG02B.dbf - 1500M (Mirror of Group 2) LOG04B.dbf -1 500M (Mirror of Group 4) Redo Members LOG02A.dbf - 1500M (Primary of Group 2) LOG04A.dbf -1 500M (Primary of Group 4) Redo Members LOG01B.dbf - 1500M (Mirror of Group 1) LOG03B.dbf -1 500M (Mirror of Group 3) Filesystem 1 (vxfs) Layout=Stripe STWidth = 1M /u05/redo1 Filesystem 2 (vxfs) Layout=Stripe STWidth = 1M /u05/redo2 Filesystem 3 (vxfs) Layout=Stripe STWidth = 1M /u05/redo3 Filesystem 4 (vxfs) Layout=Stripe STWidth = 1M /u05/redo4 } Raid 1 Single Lun of 66.4GB Raid 1 Single Lun of 66.4GB 72GB, 15K RPM FC } 72GB, 15K RPM FC 72GB, 15K RPM FC } 72GB, 15K RPM FC 26 Final Configuration Details (contd.) Redo groups 4 redo groups configured with 2 members each. The log members were placed in such a manner on the redo filesystems to eliminate LGWR and ARCH IO contention. Each member was 8G in size (8G log members would reduce the log switches from 1 switch per minute to 1 switch every 7 minutes). Reducing log switches improves performance as during a log switch, redo generation is disabled. 8G was an ideal size – log archiving completed within 2 minutes whereas log switches happened every 7 minutes. Increasing the log member size resulted in higher wait times for the events – log file switch completion and log file sequential read. However the overall performance gain was well worth it. 27 Final Configuration Details (contd.) Session The log buffer was set to 72M (after several iterations). A 72M log buffer along with 8G log file members resulted in an higher response time for the event – log file switch completion. However we completely eliminated the wait event – log buffer space (even when simulating 1.5X load). 72M appears to be an ideal size for a redo generation rate up to 14MB/sec. The _log_io_size is set to a maximum of 1M irrespective of the log_buffer size once the log_buffer crosses ~ 6MB. Also since we had a Storage subsystem which was quite capable of handling upto 32M in a single write within acceptable response time, we did not downsize the log_io_size. 28 Final Configuration Details (contd.) Session (contd.) Improving the LGWR write performance however resulted in the redo allocation latch contention. To reduce the redo allocation latch contention, we increased the parallelism for the shared redo buffer from the default of 2 to 12. _log_parallelism_max = 12 # Default is 2. Max - Limited to CPU count _log_parallelism = 4 # Default is 1 By enabling log_parallelism, the shared log buffer is split into log_parallelism_max sections each assigned a redo allocation latch. As per oracle documentation, the redo allocation latch for the shared log buffer is randomly assigned to the requestor and then does a round-robin allocation. We did notice that this was not an optimal way of assignment. 29 Final Configuration Details (contd.) ARCH The ARCH process reads OS sized blocks as set by the _log_archive_buffer_size parameter. The default and maximum value on Solaris with Oracle 10g is 2048 OS blocks (equates to 1MB reads). So the archive logs filesystem was also created as a stripe filesystem with 1MB stwidth. Performance improved as the redo logs filesystems and the archive filesystems were both stripe filesystems with 1MB stripe width. Average ARCH throughput was around 150MB/sec. However we did notice that the ARCH process reads from the primary group member only. It does not read simultaneously from both the members. We did not change the log_archive_max_processes from default (2). 30 Final Results Peak Work load showed an improvement of 7x. Least improvement was 4x. At 1.5X load, the scalability was near linear. 31 The results – Work Load Profile (redo) # Session Session Session Session LGWR LGWR LGWR Stat Before After commit wait requested 72/sec 520/sec commit wait performed 72/sec 520/sec redo entries generated 13326 entries/sec 14677 entries/sec redo size 9.45MB/sec 10.1MB/sec redo synch writes 72 sync writes/sec 520 sync writes/sec redo writes 74 writes/sec 845 writes/sec lgwr throughput 9.8 MB/sec 10.75 MB/sec 32 The results – Waits Before Event latch: redo allocation latch: redo writing log buffer space log file parallel write log file sequential read log file switch completion log file sync log file switch (checkpoint incomplete) Waits/sec Negligible Negligible 3.5 waits/sec 74 waits/sec 10 waits/sec 0.5 waits/sec 72 waits/sec 0.5 waits/sec Before (average wait time) Negligible Negligible 28.6 ms 11.8 ms 7.02 ms 20.82 ms 26.45 ms 0.38 ms Waits/sec 0.002 waits/sec 0 waits/sec 0 waits/sec 845 waits/sec 10.5 waits/sec 0.02 waits/sec 519 waits/sec 0 waits/sec After Target After Threshold (average wait time) (average wait time) 0.9 ms Eliminate 0 ms Eliminate 0 ms Eliminate 0.55 ms < 2ms 16.62 ms ? 31.5 ms ? 2.13 ms < 5 ms 0 ms Eliminate 33 The results – Stats 34 The results – System # CPU CPU Parameter Average Run Queue Average Involuntary context switches for LGWR Before 1.2 30% 8ms (Reads) 12ms (Write) 8ms (Reads) 12ms (Write) 8ms (Reads) 12ms (Write) After 1.2 < 0.1 % 16ms (Reads) < 1ms (Write) 16ms (Reads) < 1ms (Write) 16ms (Reads) < 1ms (Write) Threshold No Change Eliminate 10ms (Reads) 2ms (Write) 10ms (Reads) 2ms (Write) 10ms (Reads) 2ms (Write) Storage Average Redo Lun response time Storage Average File response time (redo logs) Storage Average Redo Volume Response Time 35 Final Thoughts In order of biggest impact to performance (in descending order), 1. IO Subsystem (50%) 2. Redo Groups layout and sizing of log file members (20%) 3. CPU Scheduling (15%) 4. Log Buffer (10%) 5. Log Parallelism (5%) The LGWR process in 10g is incredibly efficient requiring minimal tuning, however it would have been ideal if there was dedicated LGWR for each shared strand. One can only imagine the performance gain with multiple LGWR each servicing distinct log buffers. 36