Redo Configuration for Sustained and Scalable Performance in a Real World Environment Krishna Manoharan firstname.lastname@example.org 1 Background During routine capacity planning of an Operational Data Store, it was noticed that large extracts were not scaling as predicted. Among other things, higher log related waits were noticed when conducting the analysis. A study was performed to reduce/eliminate log related waits. 2 The Environment Oracle 10.2.0.3 Enterprise Edition 64bit on Solaris 9 Database Size – 4 TB SunFire E4800 with 12 CPU’s, 48GB RAM and 2 IO Boards. Work Load – Numerous small transactions due to replication and large batch process through ETL. Operations in logging mode. Commits done for every row change (replication) to commits after million+ rows changes (ETL). 3 Work Load Profile (redo) of the instance (Peak Hours) # Session Session Session Session LGWR LGWR LGWR Stat commit wait requested commit wait performed redo entries generated redo size redo synch writes redo writes lgwr throughput Value 72/sec 72/sec 13326 entries/sec 9.45MB/sec 72 sync writes/sec 74 writes/sec 9.8 MB/sec 4 Objectives of the study Deliver predicted scalability by Reduce/Eliminate log related waits. Improve on log related statistics. Eliminate LGWR/ARCH as a bottleneck. Performance as measured by Improved transaction rates. Meeting pre-defined thresholds for waits and stats. Build a standard for an optimal and scalable redo/archive configuration. Performance tuning measure – not capacity planning. Limited to configuration only, no code changes. 5 Symptoms from Oracle Top log related wait statistics (Peak Hours) Existing (average wait time) Negligible Negligible 28.6 ms 11.8 ms 7.02 ms 20.82 ms 26.45 ms 0.38 ms Threshold (average wait time) Eliminate Eliminate Eliminate < 2ms < 10ms ? < 5 ms Eliminate Event latch: redo allocation latch: redo writing log buffer space log file parallel write log file sequential read log file switch completion log file sync log file switch (checkpoint incomplete) Waits/sec Negligible Negligible 3.5 waits/sec 74 waits/sec 10 waits/sec 0.5 waits/sec 72 waits/sec 0.5 waits/sec 6 Symptoms from Oracle – contd. Top log related instance statistics (During peak hours) # LGWR LGWR Session Session Session Statistic Existing Threshold redo sync response time 0.27 ms/sync write < 0.1 ms redo write response time 0.12ms/write < 0.1 ms/write redo buffer allocation retries 0.0002 retries/entry, 3.8 retries/second Eliminate redo log space requests 0.000081/entry, 1 request/second Eliminate redo log space wait time 0.12ms/wait for space < 0.1 ms/wait for space 7 Symptoms from the System – contd. Top system related statistics (During peak hours) # CPU CPU Parameter Average Run Queue Average Involuntary context switches for LGWR Before 1.2 30% 8ms (Reads) 12ms (Write) 8ms (Reads) 12ms (Write) 8ms (Reads) 12ms (Write) Threshold No Change Eliminate 10ms (Reads) 2ms (Write) 10ms (Reads) 2ms (Write) 10ms (Reads) 2ms (Write) Storage Average Redo Lun response time Storage Average File response time (redo logs) Storage Average Redo Volume Response Time 8 Existing Log Configuration (Instance) No _ parameters set. log_buffer – Default (Seen as 4MB) Redo log groups – 3 Size of members – 500M log_archive_max_processes – Default (seen as 2) Using VRTS ODM and Vxfs 9 Existing System/Storage Configuration Default scheduling class – TS and Default priorities for Oracle. Thread Affinity set to 150. Storage Foundation 4.1 MP2 for Oracle Enterprise Edition. Maxphys set to 8M (system and Vxvm) Lun Queue Depth – 30 with a max of 256/Target. All luns – Raid 1 using 72GB, 15K RPM FC Drives. Storage – Hitachi AMS1000 Dual Simultaneous Active (2Gbit) Paths to each lun. Load Balancing via vxdmp. 10 Existing Physical Log Configuration (Filesystem) Redo Members LOG01A.dbf - 500M (Primary of Group 1) LOG02A.dbf - 500M (Primary of Group 2) LOG03A.dbf - 500M (Primary of Group 3) Redo Members LOG01B.dbf - 500M (Mirror of Group 1) LOG02B.dbf - 500M (Mirror of Group 2) LOG03B.dbf - 500M (Mirror of Group 3) Filesystem 1 (vxfs) /u05/redo1 Filesystem 2 (vxfs) /u05/redo2 Raid 1 Single Lun of 66.4GB } 72GB, 15K RPM FC 72GB, 15K RPM FC 11 Log Waits - Schematic User Session Event Log File Parallel Write CPU Subsystem Event Log File Switch Completion Log Group 1 Redo Logs Log Group 2 Shared Log Buffer Private Redo Strands Archive Log Files SGA CPU Subsystem Event Log Archive IO ARCH Event Log File Sequential Read } LGWR Event Log File Sync Event Log Buffer Space IO Subsystem 12 Analysis of the symptoms LGWR related Wait - log file parallel write - “Writing redo records to the redo log files from the log buffer. The wait time is the time it takes for the I/Os to complete.” High average wait time (11.8 ms). Correlating Stats High Lun response time for the redo log filesystem. High redo sync time (0.27 ms/sync write). High redo write time (0.12 ms/write). Log redo sync writes (72 writes/sec). Higher buffer allocation retries (3.8 retries/sec). High degree of involuntary context switches for the LGWR process. 13 Analysis of the symptoms (contd.) Wait - log file parallel write (contd.) With a high degree of commits, it is more important to review the average response time/wait rather than the number of waits. Every commit write wait results in a increment to the log file parallel write event. The physical layout of the redo log filesystems show a single lun used for all the groups. Since redo log members are relatively small sized, it is common practice for the Storage/System Admin to assign a single lun which is then used for all the redo filesystems. This invariably leads to poor IO performance and a slower LGWR. A slower LGWR also results in poor commit performance (sync writes) as evidenced by correlating stats. A slower LGWR results in higher buffer allocation retries because the LGWR is unable to write the redo entries and flush the buffer to the disk fast enough to meet the session requirements. The overall run queue on the system was low, however involuntary context switching (~30%) indicated that LGWR was being switched out of the CPU before it could complete it’s task. The high lun response time for the redo log filesystem’s indicated that IO was a bottleneck. 14 Analysis of the symptoms (contd.) LGWR related Wait - log file switch completion – “Waiting for a log switch to complete” Wait - log file switch (checkpoint incomplete) – “Waiting for a log switch because the session cannot wrap into the next log. Wrapping cannot be performed because the checkpoint for that log has not completed” Large number of waits (0.5 waits/sec) with high average wait time (20.82 ms). Correlating Stats redo log space requests – 1 request/sec redo log space wait time – 0.12ms/entry 15 Analysis of the symptoms (contd.) Wait - log file switch completion (contd.) During a log file switch, redo generation is disabled. So this wait directly impacts session performance. The log members were only 500M in size and thus causing frequent log switches (every 1 minute). This will result in higher waits. The log_buffer is 4M in size and during a log switch, the log buffer is flushed to disk. If there is an IO bottleneck to the redo log files, then flushing 4M of log buffer could result in higher response times. Since the redo log groups were on the same set of Filesystems, there could possibly be a IO conflict between the checkpoint and LGWR processes when doing a log switch as shown in the wait log file switch (checkpoint incomplete). However a bigger log file can also cause slower log file switches. The impact of increasing the log member size needs to be studied with respect to the event – log file switch completion. 16 Analysis of the symptoms (contd.) ARCH related Wait - log file sequential read – “Waiting for the read from this logfile to return. This is used to read redo records from the log file – either for recovery or archival. The wait time is the time it takes to complete the physical I/O (read)” High number of waits (10 waits/sec) with high average wait time (7 ms). Correlating Stats High Lun response time for the redo log filesystem. Event – log file parallel write (high average wait time – 11.8 ms) 17 Analysis of the symptoms (contd.) Wait - log file sequential read (contd.) Small sized redo log members cause frequent log switches (1/minute). These logs need to be archived and thus indirectly impacts the event log file sequential read. Members of the redo groups were located on the same filesystems and share the same physical LUNS. This results in IO contention because the ARCH process is reading from the previous group as the LGWR is writing to the present group . This in turn impacts LGWR write performance thus resulting in increased response time for the events - log file parallel writes and log file sync waits. Poor archival performance can also indirectly impact log switches as reported in the event – log file switch (archival incomplete) and thus session performance. For 500 M log members, the average response time is on the higher side again indicating an IO contention. Since redo log members are relatively small sized, it is common practice for the Storage/System Admin to assign a single lun which is then used for all the redo filesystems. The nature of the access being sequential, this problem is multiplied in effect – especially if the lun is Raid 5. Increasing log file sizes can also cause this event to report higher wait times. 18 Analysis of the symptoms (contd.) Session related Wait - log buffer space – “Waiting for space in the log buffer because the session is writing data into the log buffer faster than LGWR can write it out” High number of log buffer space waits (3.5 waits/sec) with an average response time of 28.6 ms. Correlating Stats Event – log file parallel write (high average wait time) redo buffer allocation retries (3.8 retries/sec). 19 Analysis of the symptoms (contd.) Wait - log buffer space (contd.) This along with the high response time for the log file parallel writes wait shows a slow LGWR. The presence of higher redo log buffer allocation retries also correlate this wait. It also can mean that the default log buffer (4MB) is too small for the rate of redo generation (9.45 MB/sec). During a log switch, LGWR flushes the log_buffer to disk. So the impact of increasing the size of the log_buffer needs to be analyzed with respect to the event – log file switch completion. 20 Analysis of the symptoms (contd.) Session related Wait - log file sync – “When a user session commits, the session's redo information needs to be flushed to the redo logfile. The user session will post the LGWR to write the log buffer to the redo log file. When the LGWR has finished posting, it will post the user session. The wait time includes the writing of the log buffer and the post.” The average wait time was 26.45 ms. Correlating Stats Event – log file parallel write (high average wait time – 11.8 ms) High degree of involuntary context switches for both user session and LGWR. 21 Analysis of the symptoms (contd.) Wait - log file sync (contd.) Every commit write wait/immediate will result in an increment of the wait counter and a redo write (resulting in an increment to the log file parallel write wait counter). Rather than the number of waits, the average wait time is important for this wait event. Under ideal circumstances, the average wait time for a log file sync event should be the about the same as the average wait time for the wait – log file parallel write. If there is a difference, then it probably indicates a CPU bottleneck for the session. Higher wait times can be a result of slow LGWR as well as CPU bottleneck (evidenced by high involuntary context switches for session processes) 22 Initial Conclusions From the waits and stats, we came to following conclusions LGWR The underlying IO subsystem for the redo logs needed to be improved. The redo log members needed to be resized from 500M to a suitable size. Also increase the groups from 3 to 4. Reduce LGWR involuntary switches by addressing OS scheduling issues. 23 Initial Conclusions (contd.) ARCH Separate the redo log groups onto dedicated filesystems to prevent contention between ARCH and LGWR. Session Increase log buffer from the default to a suitable value taking into consideration impact on the event log file switch completion. 24 Final Configuration Details After 30 or so runs, we finally arrived at the below optimal configuration. Redo Filesystem configuration (to address IO issues) Striped filesystems on dedicated Raid 1 luns configured for the redo logs as shown in the next slide. Filesystem is vxfs with 8k block size. Stripe Width = 1M LGWR configuration (to address involuntary context switches) The FX scheduling class was set for the LGWR process. The CPU time quantum was increased to 1000 and the priority set to 59. # priocntl -s -c FX -m 59 -p 59 -t 1000 -i pid
Tuesday, September 30, 2008
We resolved a number of LGWR issues and the steps we took are detailed in the presentation.
Posted by SSK at 12:29 PM