Thursday, August 7, 2008

Oracle Database Capacity Planning

I do capacity planning for the EBI infrastructure at the place I work and created a presentation to share the methodology with my team.

Read this document on Scribd: Oracle Database - Capacity Planningv3

Oracle Database – Capacity Planning Krishna Manoharan krishmanoh@gmail.com 1 Introduction – Capacity Planning Capacity planning is essential to deliver a predetermined optimal/consistent user experience throughout the lifecycle of a solution. Capacity planning refers to identifying the changes (from a capacity perspective only) needed to be done to the environment to maintain this predefined user experience over the lifecycle of the solution. In the simplest of terms, this changes can refer to adding more CPU/Memory/Storage/Network capabilities along with suitable configuration changes to the application (Grid, version upgrades, 64bit vs 32bit etc) as and when identified to be required. 2 Capacity Planning or Performance Tuning ? Capacity planning is pro-active whereas performance tuning is mostly reactive. Capacity planning is anticipating demand ahead of time and recommending suitable changes to the environment. Capacity planning (unlike performance tuning) is not an exact science in the sense it requires some guess work based on prior history and experience with the environment. I would feel, Performance tuning is trying to get the best out of existing infrastructure – for e.g – rewriting sql, creating an index etc. When a user complains of poor performance of the application, it is important that you establish if this is a capacity constraint resulting in sub optimal user experience or code/application issues. Capacity planning can help identify performance issues early on. 3 Capacity Planning Model Collect Stats New Requirements Establish Pattern and Behaviour Monitor (Update Profile with Stats on a Regular basis) Create Profile Thresholds Predict Possible Threshold Violations Threshold Violations Establish Thresholds One Time Define Action Plan for threshold violations Performance (Application) Capacity Constraint Resolve Change based on Action Plan Thresholds under control 4 Collecting stats & Profiling The first step is to identify suitable stats and capture them (assuming the application is in steady state). Statistics is from the application as well as the from the infrastructure (CPU/Memory/Storage etc). Then, one needs to start with profiling the application. Profiling the environment will help in Understanding the needs of the environment Correlating statistics from the application with the infrastructure. Charting and predicting growth using the previously established thresholds and As a result - proper capacity planning to meet the growth. 5 Profiling an application (contd.) Profile is basically a snapshot of the application. It enables you to see how it is performing with key statistics and changes over a period of time. Profiling can in-turn help identify performance issues and bottlenecks as an additional benefit during the process of capturing statistics. Once profiling is done, the next step is to establish thresholds. 6 Thresholds Thresholds indicate your comfort level – for e.g. – Redo/Day cannot exceed 50GB/day beyond which I need to revisit my redo configuration. Thresholds need to be defined and set for the key statistics in the profile. Also you identify the course of action to be followed if a threshold is violated. Reviewing the key statistics in the profile on a daily/weekly basis will allow you to plan in advance as to what changes need to be done. 7 Oracle perspective How can I do capacity planning on a pro-active basis for my oracle instance? 8 Oracle – Capacity planning The answer to this lies in reviewing, collating and corroborating Oracle statistics with statistics from various other subsystems such as OS, Storage and Network over a period of time. The key is to know which statistics to look at, how to interpret the numbers and establish thresholds. It is essential to know when to drill down into session level stats and when to stick to the top level as otherwise the stats will become overwhelming. Not to forget, Capacity planning is pro-active whereas performance tuning is mostly reactive. 9 Oracle Stats and Wait events From an oracle perspective, both stats as well as wait events needs to be captured on an ongoing basis. One would capture stats at an instance level and if required at a session level. To begin with, one can start with instance level stats collected every 24 hrs. The finer the interval, the more accurate the results, however it can get very cumbersome. It is best not to use the dba_hist views/AWR, but rather collect the stats from the v$views. The v$views are mostly incremental views and contain data from the instance startup time. 10 Oracle Stats and Wait events – contd. Stats can be collected for Work Load User related (Transactions, logons, parses etc) Redo activity Undo activity Temp activity Tablespace and object space usage PGA usage SGA usage Parallel Operations IO Operations File Stats and Temp Stats 11 Oracle Stats and Wait events – contd. Wait events help mostly in performance tuning and identifying steady state behaviour. For wait events, Top 10 waits including CPU time ordered by Time Waited along with Average Wait time Total Waits Wait class Filter out idle and parallel (PX*) waits. 12 Infrastructure Statistics From an Infrastructure perspective, to begin with the following stats can be collected. CPU – Utilization, run queue, context switches (voluntary and involuntary), interrupts, system calls, thread migrations) Storage – Number of IOPS/second, Queue Depth, Size of IOPS, Response time (lun level, volume and file), throughput. Filesystem – Usage, response time and growth. Memory – Physical memory consumed, swap in/out, page faults Network – Throughput and details from netstat –s and kstat. It is important to note that OS stats are generally not event driven and are time sampled. So they need to be correlated with application stats to make sense. 13 Basic Oracle Instance profile These stats allow us to create a simple and basic profile of the instance which can be used for daily reporting (shown next slide). It is important to note that even though many magnitudes of statistics are collected everyday, the profile should present only sufficient information to enable a decision to warrant further investigation if required. 14 A Simple profile for a Datawarehouse instance 15 A Simple profile – contd. DW Event read by other session db file sequential read db file scattered read direct path read temp log file sync log file parallel write direct path write temp db file parallel write control file parallel write os thread startup Class User I/O User I/O User I/O User I/O Commit System I/O User I/O System I/O System I/O Concurrency Average Wait (centiseconds) -90 days ( % Delta) -30 days (% Delta) -7 days (% Delta) 2 2.1 3 0.6 0.5 0.5 0.8 0.76 0.81 1.56 1.8 1.6 Not Present Not Present 0.5 Not Present Not Present 0.12 1.73 1.5 1.32 0.16 0.12 0.15 Not Present Not Present 0.98 18 14 18 Today 2 0.6 0.8 1.6 1 0.2 1.7 0.15 1 18 Threshold 2 0.5 <1 < 1.5 0 0 <1 < 0.1 0 < 12 16 Oracle - Capacity planning (contd.) To summarize Profile the environment Collect and collate initial set of statistics when environment is steady state and user response time is deemed satisfactory – Oracle, OS, Storage, Network . Define and establish thresholds – Oracle, OS, Storage and Network. As before, user response time should be deemed satisfactory. Repeat statistics collection over a defined period of time – Maybe monthly or quarterly. Establish a pattern of change – certain statistics increase over a period of time, whereas others decrease. Based on the pattern of change, plan on adding additional capacity. At any point during this time, bottlenecks can be identified and resolved accordingly. 17 Oracle Stats and Waits – v$views Common Views v$sysstat v$sys_time_model v$pgastat v$sgainfo v$filestat v$tempstat dba_free_space v$sesstat v$system_event v$session_event v$segstat Comments Most oracle statistics CPU Wait Statistics PGA Statistics SGA Statistics File IO Statistics Temp file statistics Tablespace space usage Session Statistics Wait Statistics Session Wait Statistics Segment Statistics 18 Oracle Stats Detail (Can be collected on a daily basis) Workload NAME db block changes DB time CPU used by this session Redo NAME redo buffer allocation retries redo log space requests redo log space wait time redo blocks written redo entries redo size redo writes background checkpoints completed redo synch writes redo synch time redo size redo write time redo wastage User related NAME opened cursors cumulative parse count (failures) parse count (hard) parse count (total) parse time cpu execute count logons cumulative user commits user rollbacks Source v$sysstat v$sysstat v$sysstat Source v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat Source v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat PGA NAME aggregate PGA target parameter aggregate PGA auto target maximum PGA allocated global memory bound total PGA used for auto workareas over allocation count cache hit percentage sorts (disk) sorts (memory) sorts(rows) workarea executions - multipass workarea executions - onepass workarea executions - optimal workarea memory allocated SGA NAME Buffer Cache Size Shared Pool Size Large Pool Size Maximum SGA Size Free SGA Memory Available prefetched blocks aged out before use Undo NAME consistent gets undo change vector size consistent changes DBWR undo block writes transaction rollbacks Source v$pgastat v$pgastat v$pgastat v$pgastat v$pgastat v$pgastat v$pgastat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat Source v$sgainfo v$sgainfo v$sgainfo v$sgainfo v$sgainfo v$sysstat Source v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat 19 Oracle Stats – Sample Parallel NAME DDL statements parallelized DFO trees parallelized DML statements parallelized Parallel operations downgraded 1 to 25 pct Parallel operations downgraded 25 to 50 pct Parallel operations downgraded 50 to 75 pct Parallel operations downgraded 75 to 99 pct Parallel operations downgraded to serial Parallel operations not downgraded queries parallelized IO Related NAME physical read total bytes physical read total IO requests physical reads direct physical reads direct temporary tablespace physical read total multi block requests physical write total bytes physical write total IO requests physical write total multi block requests physical writes direct physical writes direct temporary tablespace user I/O wait time Source v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat Source v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat (Can be collected on a daily basis) Enqueue NAME enqueue timeouts enqueue waits enqueue deadlocks enqueue requests enqueue conversions enqueue releases Table and Index NAME table scans (short tables) table scans (long tables) table scans (rowid ranges) table scans (direct read) table fetch by rowid table fetch continued row index fast full scans (full) index fast full scans (rowid ranges) index fast full scans (direct read) index fetch by key Source v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat Source v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat 20 Oracle Stats Detail – Sample (Can be collected on a daily basis) DATE 01-Aug-08 02-Aug-08 03-Aug-08 04-Aug-08 EVENT db file scattered read db file scattered read db file scattered read db file scattered read TOTAL_WAITS TOTAL_TIMEOUTS TIME_WAITED AVERAGE_WAIT 1838346 0 6033377 0.8 1906533 0 6034577 0.75 1754866 0 5965344 0.9 2356571 0 6154334 0.23 WAIT_CLASS User I/O User I/O User I/O User I/O 21 Oracle Stats Detail Used Space in MB Date 01-Aug-08 02-Aug-08 03-Aug-08 04-Aug-08 Avl Space in MB Date 01-Aug-08 02-Aug-08 03-Aug-08 04-Aug-08 Tablespace1 25000 25120 25220 25989 (Can be collected on a daily basis) Tablespace2 Tablespace3 Tablespace n 31000 14000 13210 32001 14990 13210 32150 15010 13210 33000 15201 13210 Tablespace1 10000 9880 9780 9011 Tablespace2 Tablespace3 Tablespace n 4000 21000 21790 2999 20010 21790 2850 19990 21790 2000 19799 21790 22 Oracle Stats Detail Datafile /DW/dat01/file1.dbf /DW/dat01/file2.dbf /DW/dat01/file3.dbf /DW/dat01/file4.dbf .. .. .. /dev/vx/rdsk/dwdg/dwtmp0 (Can be collected on a daily basis) IOPS/Day Avg Response Time/Day Max Response Time/Day 35000 10ms 24ms 120000 12ms 15ms 68461 15ms 18ms 58799 8ms 10ms 130000 30ms 68ms 23 Infrastructure Statistics (Can be collected on a daily basis) Infrastructure Stats cpu user time cpu sys time context switches (inv and vol) system calls Filesystem usage Thread migrations Interrupts Run queue Network Stats Memory Stats File IO Stats (Complements oracle) Volume Stats Lun Stats Queue depth, throughput, response time Storage allocated Comments sar, vmstat, mpstat sar, vmstat, mpstat mpstat sar, vmstat, mpstat df mpstat vmstat, mpstat vmstat, sar, w netstat and kstat vmstat odmstat vxstat vxdmpadm, swat and iostat vxstat, odmstat, swat, iostat vxdg 24