tag:blogger.com,1999:blog-15508403879463501372024-03-12T21:34:31.224-07:00Performance Engineering and Capacity PlanningTopics relating to performance engineering/analysis, capacity planning, application impact and forensics etc as related to large scale deployments.Unknownnoreply@blogger.comBlogger31125tag:blogger.com,1999:blog-1550840387946350137.post-1602039861206251052011-04-11T09:36:00.000-07:002011-04-11T09:36:16.779-07:0011gr2 - Analyzing Compression capabilities<div dir="ltr" style="text-align: left;" trbidi="on"><br />
With 11gR2, Oracle has come a long way with regards to compression capabilities. In almost all benchmarks I have performed, compression is probably one of most significant factors influencing performance of queries. Compression beats any other database optimization feature hands-down.<br />
<br />
<br />
While fast returning queries are alone not enough to meet Business requirements, in 11gR2, it does appear that compression performs adequately for all operations when compared to default. It does have some caveats, though overall, the results are quite good. It may make sense to evaluate using Compression as the default for user data. <br />
<br />
To check the viability of compression for a typical use case, I conducted 5 tests -<br />
<ul style="text-align: left;"><li>CTAS</li>
<li>Updates</li>
<li>Conventional Inserts</li>
<li>Deletes </li>
<li>Queries</li>
</ul> The variables were <br />
<ul style="text-align: left;"><li>Time to complete each activity </li>
<li>Object Size after each activity</li>
<li>Query performance</li>
</ul><br />
I did not measure system load during each of these activities. I assume that CPU and Memory are sufficient.<br />
<br />
As always, I used realistic example with real data (not generated). The tests were conducted on a single instance 11gr2 database on Linux X86-64. I was running the tests on a Quarter Rack Exadata with 3 storage cells. However, all cell offloading was disabled. <br />
<br />
The table in question had 148 million rows and was 36 columns wide. The columns were a mix of varchar(2), number and date with not null constraints. The table is not partitioned. <br />
<br />
<b>Table Creation using CTAS</b> <br />
The first test was the object creation using CTAS. It was done in parallel and with the defaults for extent sizing (auto allocate). <br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgcHQYfMNUZ3PgmgSjKC-_SrJ9WjL1H8Us2leCzkAGfhKrwWNjB13U_C_0WVXL7FXfxj4Qlh6qpHooQfhDdy2kWy2MhIqDHgvSj_UWDgUx2MqizkMYQ6QmTrvP-MDkC0l86oLPFucfx8zaK/s1600/Compression_size2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="408" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgcHQYfMNUZ3PgmgSjKC-_SrJ9WjL1H8Us2leCzkAGfhKrwWNjB13U_C_0WVXL7FXfxj4Qlh6qpHooQfhDdy2kWy2MhIqDHgvSj_UWDgUx2MqizkMYQ6QmTrvP-MDkC0l86oLPFucfx8zaK/s640/Compression_size2.jpg" width="640" /></a></div><br />
I also generated a flat file for the table and compressed using Gzip and Bzip to get an idea as to how it compares to Database compression.<br />
<br />
As you can see, database compression ranges from 2.5x (OLTP) to 9x (HCC Archive) which is more or less comparable to what is normally seen in the real world. If you had historical read only data, then storing as External Tables (compressed flat files) would probably be a better idea than storing in the database as a regular uncompressed table. With 11gr2, external tables have come a long way.<br />
<b><br />
</b><br />
<b>Updating rows in the table</b><br />
<br />
Compression has always received a bad review due to poor performance during updates. But how bad of a performance hit is there with an update? When talking about performance, I would be referring to time taken to update, growth in size of the object and followed by query performance.<br />
<br />
Generally I would assume that if you are planning on updating > 10% of a big table, it would be better to rewrite the update as a CTAS rather than do an update. In order to simulate a worst case scenario, I updated 11% of the table (15.5 Million rows) - 2 columns in order to gauge the effect of the update.<br />
<br />
<b>Deleting rows in the table</b><br />
<br />
The same can be said of deleting rows too. In order to guage the impact of deletion, I deleted 6% (5 Million rows) from the table.<br />
<br />
<br />
<b>Inserting rows into the table</b><br />
<br />
Direct path loads may not be feasible at all times. So I inserted 1 million rows into the table using buffered path writes.<br />
<br />
<b>Query performance</b><br />
<br />
After each of the tests, I ran a query which required a full tablescan to see the impact on query performance.<br />
<br />
<b>Results</b><br />
<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgjqefz8P-6KYBGh4o13UtovNkjl_wbV7dJxYd-9A1Xfv9XtV9nuY8rGARVf9VH4Jr53obNkDcNVfBCG2L0kWunSglk_6J_RUl5nAECM-JesvC85d7oY31AWz2jiTr8tkEpaWWEruK5d1dz/s1600/DML_Perf2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="372" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgjqefz8P-6KYBGh4o13UtovNkjl_wbV7dJxYd-9A1Xfv9XtV9nuY8rGARVf9VH4Jr53obNkDcNVfBCG2L0kWunSglk_6J_RUl5nAECM-JesvC85d7oY31AWz2jiTr8tkEpaWWEruK5d1dz/s640/DML_Perf2.jpg" width="640" /></a></div><br />
For a CTAS and Update, the uncompressed version of the table outperformed the compressed versions. There was a difference of approx 2x. However for Deletes and Inserts, the performance was either the same or slightly better with the compressed versions.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgNMEEqJEVBNYcYHvpTT19HOtKUfhl58IkiO2LSW_icYfBuXQn2s15tW9a0Y41rkYo7nKIsWB9NOFt0EapMyBmHjCItv9WX2mLvlUm87UY5gwu0_PvLgL09lCtnkYvG-koI0b_1o9PYgl9T/s1600/Query_performance2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="364" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgNMEEqJEVBNYcYHvpTT19HOtKUfhl58IkiO2LSW_icYfBuXQn2s15tW9a0Y41rkYo7nKIsWB9NOFt0EapMyBmHjCItv9WX2mLvlUm87UY5gwu0_PvLgL09lCtnkYvG-koI0b_1o9PYgl9T/s640/Query_performance2.jpg" width="640" /></a></div><br />
<br />
As regards Query performance, compressed tables always outbeat the non-compressed version. Higher compression gives better query performance.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjWX4cI979iLMM8AhsGqE4h1mmuXPOSboDRXkjrtwWZNPz0Yg9R3R3do3w9B0Kbi4B9J-VQ5ZDzfvAqcTHFDHkBDJivyMDMW1QcArT5r_YuxIu41NBHN5K9xONg3Y_0_Lm605NkNf7tO4Ml/s1600/DML_Size2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="346" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjWX4cI979iLMM8AhsGqE4h1mmuXPOSboDRXkjrtwWZNPz0Yg9R3R3do3w9B0Kbi4B9J-VQ5ZDzfvAqcTHFDHkBDJivyMDMW1QcArT5r_YuxIu41NBHN5K9xONg3Y_0_Lm605NkNf7tO4Ml/s640/DML_Size2.jpg" width="640" /></a></div><br />
And finally the table size after each DML operation. A bulk Update has resulted in growth of the table, however not anywhere as close to the uncompressed versions.Inserts have re-used space from the Deletion.<br />
<br />
<b>Conclusions</b><br />
<br />
<ol style="text-align: left;"><li>With compression, space savings can be significant. </li>
<li>DML does grow the object, however the size is still considerably smaller than the uncompressed version.</li>
<li>Bulk Updates still perform slower than uncompressed. </li>
<li>Deletes and conventional inserts perform about the same as the uncompressed version.</li>
<li>Query performance - Compression improves performance significantly.</li>
<li>For historical or archived read-only data, External Tables as compressed flat files may be a viable option rather than storing in the Database.</li>
</ol><br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
</div>Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-1550840387946350137.post-2928479971976656832010-12-15T14:28:00.000-08:002010-12-15T21:21:39.451-08:00A tool for analyzing Storage Performance - SWAT from OracleStorage is probably the most commonly found bottleneck in majority of the databases.<br />
<br />
Most Database Administrators are not Storage Engineers and vice-versa. For a Database Administrator, translating a Database requirement/problem into Storage terminology is not easy - even with all the stats available from the Oracle Database. <i><b> </b></i><br />
<br />
<i><b>Generally, there are 2 common storage issues faced - performance problems and capacity planning related. </b></i><br />
<br />
<br />
As regards performance, rather than saying <i>"I am seeing a lot of storage related waits , so Storage is slow, can you please fix it"</i> - it is preferable to talk to your Storage Admin in a language he can understand.<br />
<br />
<i>Remember, Storage Engineers look at storage performance from a LUN perspective and not from an application stats perspective.</i> They would be least concerned about log file parallel write wait times being 20ms. On the other hand, if you tell them that LUN 10 and 11 (being used for redo) are write intensive with a lot of small block IOPS and you am seeing high response times, it makes perfect sense to them. <br />
<br />
From a capacity planning and performance trending perspective, how are we to actually track changes in storage performance over a period of time? Capacity planning is not just about free space, it is an overall look at the health of your storage subsystem with respect to increasing application demands.<br />
<br />
Since it is the Database Administrator making the request, it would probably make sense for the DBA to learn Storage terminology. Knowledge is power and knowing your requirements well would make all the difference and possibly earn the respect of your Storage Engineer. <br />
<br />
Most Storage Engineer generally look for the below with regards to addressing performance issues and capacity planning (today and future)<br />
<br />
<ol><li>Size of the working Data Set <br />
</li>
<li>Number of IOPS being generated </li>
<li> Size of an IOP <br />
</li>
<li>% Reads versus % Writes<br />
</li>
<li>Expected Response Time versus actuals</li>
<li>Data Rate or Bandwidth consumed today</li>
<li>And so on.</li>
</ol><br />
We can get into a lot more details into each of the above line items. To keep it simple and begin a conversation , this would suffice.<br />
<br />
How are we to obtain this information from an Oracle Database?<br />
<br />
<br />
Yes, it is possible to get this information from the various v$views and stats, however what we need is a simple tool which can monitor the system and collect/collate/trend all this information in a meaningful manner.<br />
<br />
<br />
<br />
There are commercial tools available, however we need something which is lightweight, flexible and potentially available at no added cost.<br />
<br />
SWAT is a tool from Oracle and for the features and capabilities it offers, it is very light weight and efficient. <br />
SWAT stands for Sun StorageTek Workload Analysis Tool. There is no equivalent to SWAT that I have seen. <i><b>Incredibly enough, it is not very well known.</b></i><br />
<br />
<br />
It takes a few minutes to set it up and can run as a non-privileged user. As long as java is installed, you are good to go. Supported on Solaris, Linux and Windows systems.<br />
<br />
It can be setup to run continuously in the background and so is invaluable for capacity planning requirements.<br />
<br />
To get a copy of SWAT, I would suggest that you approach your Oracle Support contact. The author of SWAT is <a href="http://blogs.sun.com/henk/">Henk Vandenbergh</a>. He is also the author for vdbench which is an excellent tool for testing IO Subsystems.<br />
<br />
Some screen shots from SWAT. <br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVuEj2Wvjh8KoJg0c7Fn0Y6B9XnTyi2yT7VY_NF5rWORQLatvw-3PIC4Fmk6uk8T0a0bmWT7TUzBVyRpajP2uwn0lu1yy7UlXnXy4hfCbBIW3J1C8wmEMejyPkTk22WufpVtdUqQ_b5DgJ/s1600/SWAT-1.JPG" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="362" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVuEj2Wvjh8KoJg0c7Fn0Y6B9XnTyi2yT7VY_NF5rWORQLatvw-3PIC4Fmk6uk8T0a0bmWT7TUzBVyRpajP2uwn0lu1yy7UlXnXy4hfCbBIW3J1C8wmEMejyPkTk22WufpVtdUqQ_b5DgJ/s640/SWAT-1.JPG" width="640" /></a></div><div class="separator" style="clear: both; text-align: center;"></div><br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhmy0PBJTXcK3T34_LxRHoJiC_vDUkb81iqDbhojzOsrOUS7uY3k-8b8q7DJcU95bo1E7DZvBhKF0g85PY60WI5XYi2i6h_ERgSDw0cM5x_GUI-3Rbj-vDXfKo9SKmOdh35uclTYgMGARZ/s1600/SWAT-2.JPG" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhmy0PBJTXcK3T34_LxRHoJiC_vDUkb81iqDbhojzOsrOUS7uY3k-8b8q7DJcU95bo1E7DZvBhKF0g85PY60WI5XYi2i6h_ERgSDw0cM5x_GUI-3Rbj-vDXfKo9SKmOdh35uclTYgMGARZ/s640/SWAT-2.JPG" width="505" /></a></div>Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-1550840387946350137.post-84927341476732875292010-11-23T22:36:00.000-08:002010-12-15T21:29:19.471-08:00SATA versus SAS - Realistic ExpectationsAs a Performance Engineer, I do benchmarks all the time. And I frequently get asked about SAS versus<br />
SATA drives. While it is a no-brainer as to which performs better, SATA drives are priced about 50% of that of SAS drives and offer a lot more capacity.<br />
<br />
Price and capacity are attractive enough, but under what circumstances can a SATA drive perform comparably to a SAS drive? As I do benchmarks on Oracle databases, I did a few simple tests to ascertain the performance characteristics of SATA versus SAS.<br />
<br />
The drives I used for the tests are from Seagate. The SATA Drive is the <a href="http://www.seagate.com/www/en-us/products/enterprise-hard-drives/constellation-es/constellation-es-1">Seagate Constellation ES 7200 RPM 2TB Drive with SAS interfaces</a>. The SAS Drive is the <a href="http://www.seagate.com/www/en-us/products/enterprise-hard-drives/cheetah-15k">Seagate Cheetah 15K RPM 600GB Drive with SAS interfaces. </a><br />
<br />
Both are 3.5inch drives.<br />
<br />
The below table summarizes some key specs of these drives.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://sites.google.com/site/docsfordownload/files/Drive_compare.JPG?attredirects=0" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://sites.google.com/site/docsfordownload/files/Drive_compare.JPG?attredirects=0" /></a></div><br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
The newer generation of SATA drives are equipped with SAS interfaces. Basically they look the same except for internals. <i><b>Of particular interest is the Areal Density showing that SATA drives are far more densely packed than SAS.</b></i><br />
<br />
<br />
For the test, I created a partition of 64GB from the outer most sectors of the drive. From my experience, <br />
for reasonable performance from a hard drive, short stroking is a must. The degree of short stroking a drive depends on how much storage you are willing to sacrifice over performance. To give an example, for optimal performance, a 146GB SAS drive must be short stoked to ~50GB (1/3rd of the size). <br />
<br />
I used the partition as-is with no filesystem - basically a raw device.<br />
<br />
<br />
I performed 3 tests on the Drives using vdbench. <a href="http://vdbench.org/">Vdbench </a>is a tool from Oracle and is highly flexible in terms of testing options. It is authored by <a href="http://blogs.sun.com/henk/">Henk Vandenbergh</a> from Oracle.<br />
<br />
<ol><li><b>Transaction System </b>- Small reads/writes of 16K IO size simulating single block activity - 70% Reads + 30% Writes.<b> </b></li>
<li><b>DW System</b> - Large reads/writes of 1024K IO size simulating multiblock activity - 70% Reads + 30% writes. <b> </b></li>
<li><b>Hybrid System</b> - Combination of 16K and 1024K IO sizes - 70% Reads + 30% writes</li>
</ol><br />
I generally only focus on response time as this is what an application such as Oracle will report on via the wait events. Ideally a single block IO request would complete in < 5ms (peak) and a large block IO request in < 20ms (peak).<br />
<br />
This is hard to meet with Spinning Media during high concurrency and so I would settle for peak sustainable response time < 15-20 ms.<br />
<br />
The number of IOPS is dependent on Use Case Scenarios and cannot be generalized.<br />
<br />
The graphs below correlate IOPS versus Response Time. As the number of requests (IOPS) increase, the response time for a IOP will start increasing. <b><i> </i></b><br />
<br />
<b><i>We would be more interested in where exactly the hard drive fails (in the sense it cannot deliver a predictable response time)or cannot satisfy the number of IO requests.</i></b><br />
<br />
<b>Transaction System Test</b><br />
<br />
<br />
If you are considering SATA for a Transaction System, then depending on number of single block requests and your threshold for response times, it may be cost effective to use SATA. Looking at the below graph, for a 16K block, it would appear that 100 IOPS/drive would be about the maximum a SATA drive can sustain without falling of a cliff. If using an 8K block size, it would be even better.<br />
<br />
<br />
<br />
<br />
<a href="https://sites.google.com/site/docsfordownload/files/16K_Test.JPG?attredirects=0" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="400" src="https://sites.google.com/site/docsfordownload/files/16K_Test.JPG?attredirects=0" width="640" /></a><br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<b>DW System Test</b><br />
<br />
A DW system is typically characterized by smaller number of large block IO when compared to a transaction system. As you can see below, a SATA drive cannot sustain more than 30-40 IOPS before experiencing a failure in response time. And it cannot sustain more than 68 IOPS.<br />
<br />
A SAS drive is a lot more scalable and predictable than a SATA drive. Even doubling the number of SATA drives cannot equal the performance of a single SAS drive. So SAS would be a much better fit for DW than SATA. <br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://sites.google.com/site/docsfordownload/files/1024K_test.JPG?attredirects=0" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://sites.google.com/site/docsfordownload/files/1024K_test.JPG?attredirects=0" width="640" /></a></div><b>Hybrid System Test</b><br />
<br />
For a Hybrid<b> </b>System, again, it may not make sense to use SATA. You can squeeze in a few more IOPS out of the drive (max of 105) before it fails, but the response times are high.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://sites.google.com/site/docsfordownload/files/16K_1024K_mix.JPG?attredirects=0" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://sites.google.com/site/docsfordownload/files/16K_1024K_mix.JPG?attredirects=0" width="640" /></a></div><br />
<b>To summarize, for low to mid volume transaction systems, SATA drives may be quite affordable and also deliver reasonable performance. For DW and Hybrid systems, it is driven by use-case and potentially not effective or efficient to use SATA.</b>Unknownnoreply@blogger.com4tag:blogger.com,1999:blog-1550840387946350137.post-81838056955076362572010-06-19T18:46:00.000-07:002010-06-19T18:50:01.594-07:00Comparing Data Warehouse Appliances - Evaluation Criteria from an End user perspectiveIt is so easy to loose track of the big picture when comparing different DW Appliance vendors. Different technologies bring different strengths to the table. In that perspective, in a recent project, I had been working on a list of criteria to make it easy to evaluate vendors.<br />
<br />
There is no "One size fits all" and there is no perfect Appliance. Every appliance has its perceived strengths and weaknesses. It is up to the end user to evaluate and identify as to which Appliance will fit the needs adequately.<br />
<br />
From what I have seen, it is a combination of many factors responsible for a successful and stable deployment.<br />
<br />
I hope this selection criteria will be useful for others too. The presentation can be downloaded from <a href="http://sites.google.com/site/docsfordownload/files/DWAppliance-EvaluationCriteria.ppt?attredirects=0&d=1">here.</a><br />
<br />
<a href="http://www.scribd.com/doc/33300957/DW-Appliance-Evaluation-Criteria" style="display: block; font-family: Helvetica,Arial,Sans-serif; font-size-adjust: none; font-size: 14px; font-stretch: normal; font-style: normal; font-variant: normal; font-weight: normal; line-height: normal; margin: 12px auto 6px; text-decoration: underline;" title="View DW Appliance - Evaluation Criteria on Scribd">DW Appliance - Evaluation Criteria</a> <object data="http://d1.scribdassets.com/ScribdViewer.swf" height="600" id="doc_15875" name="doc_15875" style="outline-color: -moz-use-text-color; outline-style: none; outline-width: medium;" type="application/x-shockwave-flash" width="600"> <param name="movie" value="http://d1.scribdassets.com/ScribdViewer.swf"> <param name="wmode" value="opaque"> <param name="bgcolor" value="#ffffff"> <param name="allowFullScreen" value="true"> <param name="allowScriptAccess" value="always"> <param name="FlashVars" value="document_id=33300957&access_key=key-dhrjgi7c7yzxzvwfbru&page=1&viewMode=slideshow"> <embed id="doc_15875" name="doc_15875" src="http://d1.scribdassets.com/ScribdViewer.swf?document_id=33300957&access_key=key-dhrjgi7c7yzxzvwfbru&page=1&viewMode=slideshow" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" height="600" width="600" wmode="opaque" bgcolor="#ffffff"></embed> </object>Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-1550840387946350137.post-43573464372580664002010-02-02T18:00:00.000-08:002010-02-02T18:00:48.719-08:00Joins - ANSI SQL versus Traditional Oracle (+) syntaxSometime back (maybe with 9i?), Oracle introduced support for the ANSI join syntax - inner join, full outer join, left outer join and right outer join. This could be used in place of the (+) operator.<br />
<br />
I came across a curious problem on Oracle 10.2.0.4. The explain plans were markedly different (with different execution times as well) when the sql was written using ANSI syntax versus traditional oracle (+) syntax. The traditional syntax was a lot more faster and even with passing hints with the ANSI style, I could not force a similar execution plan.<br />
<br />
<span style="font-weight: bold;">ANSI Style</span><br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgIsYSRz_HjhpTBJbx9u3Y6-dpFBY_jqQ01ouNRiLnQMIcezdfAOm3e-nz6AKEAJ9t044-6GpkKKFuZnXHpYYwdv_JLbY2GL9OEJXVe3POndRo6rsvNQJhLQEVfPFOCEcaVpG2yCNhCTSji/s1600-h/ANSI-join.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="372" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgIsYSRz_HjhpTBJbx9u3Y6-dpFBY_jqQ01ouNRiLnQMIcezdfAOm3e-nz6AKEAJ9t044-6GpkKKFuZnXHpYYwdv_JLbY2GL9OEJXVe3POndRo6rsvNQJhLQEVfPFOCEcaVpG2yCNhCTSji/s640/ANSI-join.jpg" width="640" /></a></div><br />
<span style="font-weight: bold;">Traditional Oracle Syntax</span><br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi__EaLR1UuVYWk6XyZhW52qW1VtdRo3hJYZ6QTvqiBNiGLE2TdNj95OggYFcRv-OzxNgSqbf-9wb3Rlftsnx7mj0ZM12iE3nHTh73jUl2hWqX-DaZpTQYDfQzUHnGQL4tmeSSVrrFOsOKi/s1600-h/Traditional-Oracle-join.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="420" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi__EaLR1UuVYWk6XyZhW52qW1VtdRo3hJYZ6QTvqiBNiGLE2TdNj95OggYFcRv-OzxNgSqbf-9wb3Rlftsnx7mj0ZM12iE3nHTh73jUl2hWqX-DaZpTQYDfQzUHnGQL4tmeSSVrrFOsOKi/s640/Traditional-Oracle-join.jpg" width="640" /></a></div><br />
<br />
<br />
As to why the CBO was behaving such, I have no idea. But if you do run into problems with joins, it may make sense to change the sql to traditional oracle syntax.Unknownnoreply@blogger.com5tag:blogger.com,1999:blog-1550840387946350137.post-66609262474264268412010-01-27T17:08:00.000-08:002010-02-02T16:28:38.044-08:00Understanding CPU Time as an Oracle Wait event<span style="font-family: Georgia,serif;"> <span style="font-family: Georgia,serif;">We have all seen "CPU Time" as a Top 5 wait event at one point or the other when troubleshooting performance issues. From an administrator (DBA/SA) perspective, conventional thinking would be that CPU is a bottleneck.</span> <span style="font-family: Georgia,serif;"><br /> <br /> </span> <span style="font-family: Georgia,serif; font-style: italic; font-weight: bold;">But what if the stats from the system show that CPU Utilization (% Util and Run queue) are well within thresholds and show plenty of available capacity, but Oracle continues to report CPU time as a Top 5 wait event?<br /> <br /> We are also seeing high degree of involuntary context switching (using mpstat in Solaris) or context switches (using vmstat in Linux). Obviously, something is not computing right.</span> <span style="font-family: Georgia,serif;"><br /> <br /> </span> <span style="font-family: Georgia,serif;">CPU Time could mean that the process is either</span> <span style="font-family: Georgia,serif;"><br /> </span></span> <ul face="georgia"> <li> <span style="font-family: Georgia,serif;">On a CPU run queue waiting to be scheduled<br /> </span></li> <li> <span style="font-family: Georgia,serif;">Or currently running on a CPU. </span></li> </ul> <span style="font-family: Georgia,serif;"> <span style="font-family: Georgia,serif;">Obviously, we are interested in</span><br /> </span> <ul style="font-family: georgia;"> <li> <span style="font-family: Georgia,serif;">Minimizing the wait time on the run queue so that the session can run on the CPU as soon as possible. This is determined by the priority of the process.<br /> </span></li> <li> <span style="font-family: Georgia,serif;">And once running on the CPU, be allowed to run on the CPU to complete its tasks. The amount of time available for the process on the CPU is defined as the Time Quanta.<br /> </span></li> </ul> <span style="font-family: Georgia,serif; font-size: 11pt;"> <span style="font-family: Georgia,serif; font-size: 11pt;">Both of the above aspects are controlled by the OS Scheduler/Dispatcher. From the </span> <span style="font-family: Georgia,serif; font-size: 11pt;"><a href="http://en.wikipedia.org/wiki/Scheduling_%28computing%29" style="font-family: georgia;">wiki page for scheduling</a><br /> <br /> </span> <span style="font-family: Georgia,serif; font-size: 11pt; font-style: italic;">"Scheduling is a key concept in computer multitasking, multiprocessing operating system and real-time operating system designs. In modern operating systems, there are typically many more processes running than there are CPUs available to run them. Scheduling refers to the way processes are assigned to run on the available CPUs. This assignment is carried out by software known as a scheduler or is sometimes referred to as a dispatcher"</span> <span style="font-family: Georgia,serif; font-size: 11pt;"><br /> <br /> </span> <span style="font-family: Georgia,serif; font-size: 11pt;">Understanding how the scheduler shares CPU resources is key to understanding and influencing the wait event "CPU Time".<br /> <br /> In any Unix platform, there are processes which take higher priority than others. Labeling a process as higher priority can be done through the implementation of Scheduling classes and with the nice command. Both can have different effects on the process.<br /> <br /> An easy method to identify the scheduling class and current priority for a process is to use the ps command. Used with the "-flycae" arguments, it shows both the scheduling class and current priority. However it does not show the CPU time quanta associated with a process.</span></span><br /> <br style="color: rgb(0, 40, 146);" /> <div style="text-align: justify; color: rgb(0, 40, 146); font-weight: bold;"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;">dbrac{714}$ ps -flycae |egrep "ora_" |more</span><br /> <br /> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;">S UID PID PPID CLS PRI CMD</span><br /> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;">S oracle 931 1 TS 24 ora_p000_DW</span><br /> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;">S oracle 933 1 TS 24 ora_p001_DW</span><br /> </span></div> <span style="font-family: Georgia,serif;"><br /> <span style="font-family: Georgia,serif; font-size: 11pt;">In the above example, you would be interested in the CLS and PRI column. The above example shows that the oracle background processes as running under the TS Scheduling class with a priority of 24. The higher the number reported in the PRI column, the higher the priority.<br /> <br /> The default Scheduler for user processes is TS or Time Share and is common across Solaris and Linux. The TS scheduler changes priorities and CPU time quantas for processes based on recent processor usage.</span></span> <span style="font-family: Georgia,serif;"> <span style="font-family: Georgia,serif;"> <span style="font-size: 11pt;"><br /> <br /> </span> <span style="font-family: Georgia,serif;"> <span style="font-family: Georgia,serif; font-size: 11pt;">Since we appear to have plenty of CPU resources, we could draw the conclusion that the default (TS) scheduling class does not appear to be good enough for us. Either the scheduler is not allocating sufficient CPU time quanta (resulting in involuntary context switching) or not giving the process a sufficiently higher priority so that it can be scheduled earlier than other processes.</span><br /> </span><br /> </span> <span style="font-family: Georgia,serif;"> <span style="font-family: Georgia,serif;">So how do we change it? Obviously we would want to</span><br /> </span></span> <ul style="font-family: georgia;"> <li> <span style="font-family: Georgia,serif;">set a fixed priority for Oracle processes so that they are able to run on the CPU ahead of other competing processes.<br /> </span></li> <li> <span style="font-family: Georgia,serif;"> <span style="font-family: Georgia,serif; font-size: 11pt;">set a fixed time quanta for Oracle processes so that they can run to completion on the CPU.</span><br /> </span></li> </ul> <span style="font-family: Georgia,serif;">With either Solaris or Linux, the easiest way to implement this is to change the Scheduling class for the oracle processes. Both the Operating systems offer a Scheduling class with Fixed Priorities and Fixed CPU Time Quantas - Fixed in the sense it is fixed throughout the lifetime of the process, but can be changed to suit your requirements at any time.<br /> <br /> <span style="font-family: Georgia,serif;">In Linux, it is the RR class and in Solaris it is the FX class. The simplest way to change the scheduling class is to use the priocntl tool. While it is a native binary on Solaris, it is available on Linux through the Heirloom Project.<br /> <br /> On Linux, you would need to use the renice command to change the CPU time quantas and on Solaris, priocntl does both - scheduling class and time quanta.<br /> <br /> Let us look at a few examples - </span></span> <span style="font-family: Georgia,serif;"><br /> <br /> <span style="font-family: Georgia,serif;"> <span style="font-weight: bold; font-family: Georgia,serif;"> On Linux </span>- Let us try and change the Scheduling Class and Time Quanta for the Log Writer.</span></span><br /> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"></span><br /> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"></span> <span style="font-family: 'Courier New',Courier,monospace; color: rgb(0, 40, 146); font-weight: bold;">[root@dbrac root]# ./priocntl -l<br /> CONFIGURED CLASSES<br /> ==================<br /> <br /> TS (Time Sharing)<br /> Configured TS User Priority Range: -19 through 20<br /> <br /> RT (Real Time Round Robin)<br /> Maximum Configured RT Priority: 99<br /> <br /> FF (Real Time First In-First Out)<br /> Maximum Configured FI Priority: 99<br /> </span><br /> </span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"></span> <span style="font-family: Georgia,serif; font-size: 11pt;">We see that there are 3 Scheduling classes available for use.</span><br /> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"><br style="color: rgb(0, 40, 146); font-weight: bold;" /> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt; color: rgb(0, 40, 146); font-weight: bold;">[root@dbrac root]# ps -flycae |grep ora_lgwr</span><br style="color: rgb(0, 40, 146); font-weight: bold;" /> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt; color: rgb(0, 40, 146); font-weight: bold;"><br /> S UID PID PPID CLS PRI CMD</span><br style="color: rgb(0, 40, 146); font-weight: bold;" /> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt; color: rgb(0, 40, 146); font-weight: bold;">S oracle 30318 1 TS 23 ora_lgwr_DWRAC</span><br /> </span><br /> </span> <span style="font-family: Georgia,serif; font-size: 11pt;">It shows that LGWR is running in TS class with a Priority of 23.</span><br /> <br style="font-weight: bold; color: rgb(0, 40, 146);" /> <span style="font-family: 'Courier New',Courier,monospace; color: rgb(0, 40, 146); font-weight: bold;"> <span style="font-size: 10pt;"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;">[root@dbrac root]# ./priocntl -d 30318</span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"><br /> </span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;">TIME SHARING PROCESSES </span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"><br /> PID TSUPRI</span></span> <span style="font-family: 'Courier New',Courier,monospace;"><br /> </span></span> <span style="font-family: Georgia,serif;"> <span style="font-family: Georgia,serif;"> <span style="font-family: 'Courier New',Courier,monospace; color: rgb(0, 40, 146);"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt; font-weight: bold; color: rgb(0, 40, 146);"> 30318 0</span><br /> </span><br /> </span> <span style="font-family: Georgia,serif;"> <span style="font-family: Georgia,serif; font-size: 11pt;">Let us change LGWR to RT class with a RT priority of 50.</span><br /> </span><br /> </span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt; color: rgb(0, 40, 146); font-weight: bold;"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;">[root@dbrac root]# ./priocntl -s -c RT -p 50 -i pid 30318</span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"><br /> <span style="font-size: 10pt;"><br /> </span></span> <span style="font-size: 10pt;"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;">[root@dbrac root]# ./priocntl -d 30318</span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"><br /> </span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;">REAL TIME PROCESSES</span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"><br /> </span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"> PID RTPRI TQNTM</span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"><br /> </span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"> 30318 50 99</span></span></span> <span style="font-family: Georgia,serif; font-weight: bold;"><br /> </span><br /> <span style="font-family: Georgia,serif; font-size: 11pt;"> <span style="font-family: Georgia,serif; font-size: 11pt;"> <span style="font-family: Georgia,serif; font-size: 11pt;"> <span style="font-family: Georgia,serif; font-size: 11pt;">It shows that the RT priority is 50 and the Time Quanta is 99.</span></span></span></span><br /> <span style="font-family: Georgia,serif;"> <span style="font-family: Georgia,serif;"><br /> </span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt; color: rgb(0, 40, 146); font-weight: bold;"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"> <span style="font-size: 10pt;">[root@dbrac root]# ps -flycae |grep ora_lgwr</span><br /> <br /> </span></span></span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt; color: rgb(0, 40, 146); font-weight: bold;"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;">S UID PID PPID CLS PRI CMD</span></span> <span style="font-family: Georgia,serif;"> <span style="font-family: Georgia,serif;"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt; color: rgb(0, 40, 146);"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt; font-weight: bold;"><br /> S oracle 30318 1 RR 90 ora_lgwr_DWRAC</span><br /> </span><br /> </span></span> <span style="font-family: Georgia,serif;"> <span style="font-family: Georgia,serif;"> <span style="font-family: Georgia,serif;"> <span style="font-family: Georgia,serif;"> <span style="font-family: Georgia,serif; font-size: 11pt;">Note that even though the RT priority is 50, ps shows the PRI as 90</span>. </span></span></span></span> <span style="font-family: Georgia,serif;"><br /> <br /> </span> <span style="font-family: Georgia,serif; font-size: 11pt;"> <span style="font-family: Georgia,serif; font-size: 11pt;"> <span style="font-family: Georgia,serif; font-size: 11pt;"> <span style="font-family: Georgia,serif; font-size: 11pt;">Let us change the time quanta for the Log writer.</span></span></span></span> <span style="font-family: Georgia,serif;"><br /> </span> <span style="font-family: Georgia,serif;"><br /> <span style="font-family: 'Courier New',Courier,monospace; color: rgb(0, 40, 146); font-weight: bold;"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;">[root@dbrac root]# renice +2 30318</span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"><br /> </span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;">30318: old priority 0, new priority 2</span><br /> </span><br /> </span></span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt; color: rgb(0, 40, 146); font-weight: bold;"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;">[root@dbrac root]# ps -flycae |grep ora_lgwr</span><br /> <br /> </span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;">S UID PID PPID CLS PRI CMD</span><br /> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;">S oracle 30318 1 RR 90 ora_lgwr_DWRAC</span></span></span><br /> </span> <span style="font-family: Georgia,serif;"><br /> </span> <span style="font-family: Georgia,serif; font-size: 11pt;">No change in the PRI after renicing a RT process (expected). </span><br /> <br /> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt; color: rgb(0, 40, 146); font-weight: bold;">[root@dbrac root]# ./priocntl -d 30318<br /> REAL TIME PROCESSES<br /> PID RTPRI TQNTM<br /> 30318 50 89</span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"><br /> </span> <span style="font-size: 11pt;"><br /> </span></span> <span style="font-family: Georgia,serif; font-size: 11pt;">But when checking with priocntl, we see that the time quanta is now 89 (Previous was 99). <span style="font-family: Georgia,serif; font-size: 11pt;"><br /> <br /> </span>Let us see if we can increase the time quanta.</span> <span style="font-family: Georgia,serif;"> <span style="font-family: Georgia,serif;"><br /> </span><br style="font-weight: bold; color: rgb(0, 40, 146);" /> <span style="font-family: 'Courier New',Courier,monospace; font-weight: bold; color: rgb(0, 40, 146);"> <span style="font-size: 10pt; font-family: 'Courier New',Courier,monospace;"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;">[root@dbrac root]# renice -3 30318</span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"><br /> </span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;">30318: old priority 2, new priority -3</span></span><br /> <br /> </span></span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt; font-weight: bold; color: rgb(0, 40, 146);"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"></span></span></span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"></span></span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"> <span style="font-size: 10pt; font-weight: bold; color: rgb(0, 40, 146);"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;">[root@dbrac root]# ./priocntl -d 30318</span><br /> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;">REAL TIME PROCESSES</span><br /> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"> PID RTPRI TQNTM</span><br /> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"> 30318 50 459</span></span><br /> </span><br /> </span> <span style="font-family: Georgia,serif; font-size: 11pt;">Now the time quanta is 459. Higher the time quanta, the more time the process can spend on the CPU before being context switched out.</span><br /> <br /> <span style="font-family: Georgia,serif; font-size: 11pt;"> <span style="font-family: Georgia,serif; font-size: 11pt;"> <span style="font-weight: bold; font-family: Georgia,serif; font-size: 11pt;">For Solaris</span> - </span>priocntl can be used to set the Scheduling class and the time quanta simultaneously. I am not going to show any examples here as it would be the same as above.<br /> <br /> <span style="font-family: Georgia,serif; font-size: 11pt; font-weight: bold; font-style: italic;">Now, as to which processes (background/shadow) need to have a higher priority than others, that is a decision which requires significant amount of testing. I have seen 30% improvements in load timings when changing scheduling properties, however it has the potential to completely break the environment if not done correctly.</span><br style="font-weight: bold; font-style: italic;" /> <br /> Interestingly enough, when running Oracle RAC on Linux, you would notice that the lms process are now running under the RR Scheduling class.</span><br /> <br /> <span style="font-family: 'Courier New',Courier,monospace; color: rgb(0, 40, 146); font-weight: bold;"> <span style="font-family: 'Courier New',Courier,monospace;"> <span style="font-family: 'Courier New',Courier,monospace;">dbrac{720}$ ps -flycae |grep ora_lms<br /> </span></span> <span style="font-family: 'Courier New',Courier,monospace;"> <span style="font-family: 'Courier New',Courier,monospace;"><br /> S UID PID PPID CLS PRI CMD</span><br /> </span> <span style="font-family: 'Courier New',Courier,monospace;"> <span style="font-family: 'Courier New',Courier,monospace;">S oracle 9074 1 RR 41 ora_lms0_DWRAC</span><br /> <span style="font-family: 'Courier New',Courier,monospace;">S oracle 9078 1 RR 41 ora_lms1_DWRAC</span><br /> <br /> <br /> </span></span> <span style="font-family: Georgia,serif;"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt; color: rgb(0, 40, 146); font-weight: bold;"> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;">[root@dbrac root]# ./priocntl -d 30306</span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"><br /> </span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;">REAL TIME PROCESSES</span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"><br /> </span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"> PID RTPRI TQNTM</span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"><br /> </span> <span style="font-family: 'Courier New',Courier,monospace; font-size: 10pt;"> 30306 50 99</span></span><br /> </span>Unknownnoreply@blogger.com5tag:blogger.com,1999:blog-1550840387946350137.post-60675420554421902582010-01-26T11:26:00.000-08:002010-01-26T16:55:27.855-08:00ora-4030 errors and swap sizing on DataWarehousing platforms<span style="font-weight: bold; font-style: italic;">"ora-04030: out of process memory when trying to allocate %s bytes (%s,%s)"</span><br /><br />Now, what would ORA-4030 errors have to do with swap sizing?<br /><br />Now, my perspective is limited to DataWarehouse systems, so this would probably be relevant to such environments only. This is also on the Solaris platform, though a similar analogy could work on Linux too.<br /><br />In DataWarehousing, PGA sizes can get fairly large and overall PGA consumption can vary dramatically depending on the type and nature of extracts/queries and the degree of parallelism.<br /><br />I have seen PGA consumption as high as 2.5x PAT. <br /><br />Since PGA is anonymous memory (private to the process), it would need to have swap reservations in place. Or in other words, an equivalent amount of the PGA would be allocated in swap to serve as a backing store. This backing store would be used in case of actual physical memory shortages. If using DISM, then there would be the swap reservations in place for the SGA also. I have used the word reservations/allocations and for the sake of this discussion, they are the same.<br /><br />The important point to keep in mind is that regardless of whether you have actual memory shortages resulting in swap usage, swap allocation/reservation would always occur.<br /><br />Consider this scenario - (Assuming /tmp is empty)<br /><br />Operating System - Solaris<br />Memory on system = 64GB<br />Swap Device on Disk = 25GB<br /><br />Memory Requirements<br />OS + other apps = 2GB<br />SGA = 12GB (ISM and so no swap requirements)<br />PAT = 24GB<br /><br />If all your processes are consuming 24GB of aggregated PGA, then the swap backing store would require to be 24GB in size. Including the OS requirements, the total swap backing store would need to be 26GB in size.<br /><br />In Solaris, this swap reservation/allocation can be met by a combination of <span style="font-weight: bold;">free physical memory and/or physical swap device</span>. <br /><br />In this case, it is quite possible that the 26GB of swap backing store can be met entirely from free physical memory. If not, then sufficient space from the physical swap device (25GB in our case) would be used.<br /><br />Now what happens when we start exceeding 24GB? Let us look at peak PGA usage.<br /><br />Peak PGA usage (2X PAT) = 48GB<br />Total Peak Memory usage = 48GB + 12GB + 2GB = 62GB<br />Available free physical memory = 64GB - 62GB = 2GB<br /><br />48GB of PGA would need 48GB of swap reservation. 2GB of OS requirements would also require swap reservations for a total of 50GB of swap reservations.<br /><br />Since the peak total memory requirement is 62GB, only 2GB of free physical memory is available. The physical swap device is 25GB in size, thus making for only 27GB of possible available swap space. Obviously, the system cannot handle 48GB of PGA consumption. Swap is short by 23GB (50GB - 27GB).<br /><br />Even though there is still 2GB free memory, you will definitely encounter ORA-4030 errors. Along with the 4030 errors, you would also see in the /var/adm/messages file<br /><br /><span style="font-size:85%;"><span style="font-family: courier new;">messages:Jan 20 20:20:56 oradb genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 9489 (oracle)</span><br /><span style="font-family: courier new;">messages:Jan 20 20:23:15 oradb genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 9327 (oracle)</span><br /></span><br /><br /><br />So how much can the PGA grow with a 25GB Swap Device (in the above scenario)? Somewhere around 36GB would be about right.<br /><br />PGA = 36GB<br />SGA = 12GB<br />OS = 2GB<br /><br />Total consumed Physical memory = 50GB (36GB + 12GB + 2GB)<br />Available free physical memory = 14GB (64GB - 50GB)<br /><span style="font-weight: bold;">Available Swap = (64-50) + 25GB = 39GB</span><br /><br />So the recommendation would be to have physical swap devices of atleast 3X PAT (assuming your peak PGA utilization is 2.5X PAT). This way you would not run into ORA-4030 errors due to insufficient swap space.<br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><span style="font-weight: bold;"></span>Unknownnoreply@blogger.com2tag:blogger.com,1999:blog-1550840387946350137.post-87981681485667416262009-09-21T06:57:00.000-07:002009-09-21T07:03:27.722-07:00Download Link for the "Storage Design for Datawarehousing" presentationIt appears that scribd does not allow anonymous downloads any more. Hence on request, the "Storage Design for Datawarehousing" presentation can be downloaded in PDF format <a href="http://sites.google.com/site/docsfordownload/files/StorageDesignforDatawarehousingv8.pdf?attredirects=0">here.</a>Unknownnoreply@blogger.com2tag:blogger.com,1999:blog-1550840387946350137.post-11654078574832322172009-07-20T14:19:00.001-07:002009-07-21T13:25:58.157-07:00Map Disk Block devices on a Linux host back to the SAN Lun<span style="font-family:Georgia;">There was a question posted in Oracle-L - How can one map Disk Devices (On Linux) back to the SAN LUNs? This is a no brainer on Solaris virtue of persistent binding capabilities on the HBA along with hard-coding the luns in sd.conf, but Linux is a different ball game.<br /><br /><span style="font-weight: bold;">Most of you would wonder – why do I care to know about this mapping? You need to - because as a DBA if you are seeing high latencies for a specific data file(s), you would need to know to which LUN , the block device </span><span style="font-weight: bold;">(containing the data file)</span><span style="font-weight: bold;"> ties back to on the array. SAN admins work with LUNs, not with system block devices.</span><br /><br />In Linux, the block devices are named starting with "sd" with little information as to<br /><br />1. Specific Adaptor/Port(HBA) through which it is mapped.<br /><br />2. Array on which the LUN is carved out.<br /><br />3. The specific LUN number on the Array<br /><br />While it is not impossible to find out the above information, there is no standard interface which can help you determine the above information (I may be mistaken - there may be a tool out there which I am not aware of).<br /><br />On Redhat 4.x and higher versions using the Kernel 2.6.x and higher, the sysfs implementation is standard. The /sys filesystem holds the key to identifying the above information.<br /><br />Unfortunately, the /sys filesystem is not well documented and is full of<span style="font-weight: bold; font-style: italic;"> shortcuts</span> pointing back<br />and forth. In short it is a maze. But it is Linux and so we are not surprised. So we have to devise our own methods to overcome these shortcomings.<br /><br />Let us use an example to illustrate how it can be done. Let us assume that our database is showing high latencies for datafiles residing on the block device <span style="color: rgb(0, 0, 0);"><span style="font-weight: bold; color: rgb(255, 0, 0); font-style: italic;">sdf.</span><br /><br />First you need to identify how many HBAs you have on the system and how many ports/HBA<br />(HBAs come in single port or dual port configurations).<br /><br />You can identify this in multiple ways - I prefer to use lspci<br /></span></span><br /><span style=";font-family:Courier New;font-size:85%;" >$ /sbin/lspci |grep Fibre<br /><br /><br />0f:00.0 Fibre Channel: Emulex Corporation Zephyr-X LightPulse Fibre Channel Host Adapter (rev 02) <strong><em><span style="color: rgb(255, 0, 0);font-family:Georgia;font-size:100%;" >-------------- > HBA 1 Port 1</span></em></strong><br /></span><span style="font-family:Courier New;"><span style="color: rgb(0, 0, 0);"><br /><span style="font-size:85%;">0f:00.1 Fibre Channel: Emulex Corporation Zephyr-X LightPulse Fibre Channel Host Adapter (rev 02)</span></span></span><span style="color: rgb(0, 0, 0);"><span style="font-family:Georgia;"><span style="font-weight: bold; color: rgb(255, 0, 0); font-style: italic;font-size:85%;" >-------------- > HBA 1 Port 2</span><br /></span><br /><span style=";font-family:Courier New;font-size:85%;" >11:00.0 Fibre Channel: Emulex Corporation Zephyr-X LightPulse Fibre Channel Host Adapter (rev 02)</span> </span><span style="color: rgb(0, 0, 0);"><span style="font-family:Georgia;"><span style="font-weight: bold; color: rgb(255, 0, 0); font-style: italic;font-size:85%;" >-------------- > HBA 2 Port 1</span><br /></span><br /><span style=";font-family:Courier New;font-size:85%;" >11:00.1 Fibre Channel: Emulex Corporation Zephyr-X LightPulse Fibre Channel Host Adapter (rev 02)</span> </span><span style="color: rgb(0, 0, 0);"><span style="font-family:Georgia;"><span style="font-weight: bold; color: rgb(255, 0, 0); font-style: italic;font-size:85%;" >-------------- > HBA 2 Port 2</span><br /></span><br /></span><span style="font-family:Georgia;">As you can see, it shows that there are 2 Dual Port Emulex HBAs installed in the system. LUNs<br />can be visible to Linux through any or all of the 4 available ports.<br /><br />So the block device <span style="color: rgb(0, 0, 0);"><span style="font-weight: bold; color: rgb(255, 0, 0); font-style: italic;">sdf</span> as seen by the OS is visible via one of these 4 available ports (paths). Next, we need to identify the specific port through which this LUN <span style="font-weight: bold; color: rgb(255, 0, 0); font-style: italic;">sdf</span> is visible to the OS.<br /><br />For that, we change directories to /sys/block and run a find.<br /></span></span><br /><span style=";font-family:Courier New;font-size:85%;" >$ cd /sys/block<br /><br /></span><span style=";font-family:Courier New;font-size:85%;" >$ find . -name device -exec ls -l {} \; |grep sdf<br /><br />lrwxrwxrwx 1 root root 0 Jul 21 17:25 ./sdf/device -> ../../devices/pci0000:00/0000:00:04.0/<span style="font-weight: bold;"><u>0000:0f:00.0</u></span>/host1/target1:0:0/1:0:0:4</span><br /><br /><span style="font-family:Georgia;">or another way is to</span><br /><br /><span style=";font-family:Courier New;font-size:85%;" >$ cd /sys/block/sdf<br /><br />$ ls -l device<br />lrwxrwxrwx 1 root root 0 Jul 21 17:25 device -><br />../../devices/pci0000:00/0000:00:04.0/<span style="font-weight: bold;"><u>0000:0f:00.0</u></span>/host1/target1:0:0/1:0:0:4<br /></span><br /><span style="font-family:Georgia;">As you can see from the above, we can identify that sdf </span><span style="font-family:Georgia;"><span style="color: rgb(0, 0, 0);"><span style="font-weight: bold; color: rgb(255, 0, 0); font-style: italic;"> </span><span style="color: rgb(255, 0, 0);"><span style="color: rgb(51, 51, 51);">is visible to the OS via the HBA Port </span></span><span style="font-weight: bold;"><span style="font-style: italic;">0f:00.0</span> </span>to the OS. <span style="font-weight: bold;"><span style="font-style: italic;">0f:00.0 </span></span>as we have seen above is HBA 1 Port 1. This output has several more interesting tid-bits of information.</span> <br /> </span><br /><span style="font-size:85%;"><span style="font-family:Courier New;">lrwxrwxrwx 1 root root 0 Jul 21 17:25 device -> ../../devices/pci0000:00/0000:00:04.0/0000:0f:00.0/host1/<span style="font-weight: bold;"><u>target1:0:0/</u></span><span style="font-weight: bold;"><u>1:0:0:4</u></span></span><br /></span><br /><span style="color: rgb(51, 51, 51);font-family:georgia;font-size:100%;" >The target is <span style="font-weight: bold; color: rgb(255, 0, 0);">0</span> from </span><span style="color: rgb(51, 51, 51);font-family:georgia;font-size:100%;" ><span style="font-size:100%;"><span style="font-family:courier new;">1:0:</span><span style="font-weight: bold; color: rgb(255, 0, 0); font-style: italic;font-family:courier new;" >0</span><span style="font-family:courier new;">:4</span></span> or from </span><span style="font-size:100%;"><span style="font-family:courier new;">target1:0:<span style="font-weight: bold; color: rgb(255, 0, 0);">0. </span></span></span><a-z><a-z><span style=";font-family:georgia;font-size:100%;" >We can also learn from the above output that this is LUN 4 from <span style="font-weight: bold;font-size:100%;" ><span style="font-family:courier new;">1:0:0:<span style="color: rgb(255, 0, 0); font-style: italic;">4.<br /></span></span></span></span><br /><span style=";font-family:georgia;font-size:100%;" >All the remains now is to identify the array which is presenting LUN 4 to the OS. This<br />information is available from the /proc interface. </span><br /><br /><span style="font-size:85%;"><span style="font-family:courier new;">$ cd /proc/scsi/lpfc </span><br /><br /></span><span style=";font-family:courier new;font-size:85%;" >$ ls</span><span style="font-size:85%;"><br /></span><span style=";font-family:courier new;font-size:85%;" >1 2 3 4</span><span style="font-size:85%;"><br /></span><br /><span style=";font-family:georgia;font-size:100%;" >Here the numbers 1-4 correspond to the specific HBA ports 1 - 4. We have identified that LUN <span style="font-weight: bold; color: rgb(255, 0, 0); font-style: italic;">sdf </span>is on HBA 1 Port 1 and so we look into the contents of "1". We also know it is Target 0.</span><br /><br /><span style="font-family:Courier New;"><span style="font-size:85%;">$ more 1<br />lpfc0t01 DID 610413 WWPN 50:06:0e:80:00:c3:bd:40 WWNN 50:06:0e:80:00:c3:bd:40<br />lpfc0t02 DID 610a13 WWPN 50:06:0e:80:10:09:d0:02 WWNN 50:06:0e:80:10:09:d0:02<br /><span style="font-weight: bold; color: rgb(51, 51, 51);"><u>lpfc0t00 DID 612413 WWPN 50:06:0e:80:10:09:d0:07 WWNN 50:06:0e:80:10:09:d0:07</u><br /></span></span></span><br /><span style=";font-family:georgia;font-size:100%;" >We see that Target 0 is mapped to WWPN ending in 07. This is the Array WWPN from which the Storage Admin can identify the specific Array which is presenting LUN 4 <span style="font-weight: bold; color: rgb(255, 0, 0); font-style: italic;">(sdf)</span><span style="font-style: italic;"><span style="font-weight: bold;"> </span></span>to the host.</span><br /><br /><span style=";font-family:georgia;font-size:100%;" >So we have identified that the block device <span style="font-weight: bold; color: rgb(255, 0, 0); font-style: italic;">sdf </span><span style="color: rgb(51, 51, 51);">is</span><span style="font-weight: bold; color: rgb(255, 0, 0); font-style: italic;"><br /><br /><span style="color: rgb(0, 0, 0);">1. Visible to the OS via HBA 1 Port 1.<br /><br />2. Target 0 for the HBA.<br /><br />3. LUN 4 as presented to the OS by the Array.<br /></span></span><br /><span style="font-weight: bold; font-style: italic;">4. The array that is presenting LUN 4 to the OS is </span><span style="font-weight: bold; font-style: italic;font-size:85%;" ><span style="color: rgb(51, 51, 51);font-family:courier new;" >50:06:0e:80:10:09:d0:07.<br /><br /></span></span><span style=";font-family:georgia;font-size:100%;" ><span style="color: rgb(51, 51, 51);">With this above information, your Storage Admin will get a headstart on fixing the problem for<br />you. </span></span></span></a-z><br /></a-z>Unknownnoreply@blogger.com2tag:blogger.com,1999:blog-1550840387946350137.post-49934908062451928282009-07-06T10:38:00.000-07:002009-07-06T11:43:06.617-07:00Storage Design for DatawarehousingWith all the talk about exadata and netezza going around, I gave a presentation on conventional storage design for Datawarehousing to our team. For a small to medium corporate DW, conventional storage arrays perform adequately well (if properly designed).<br /><br /><a title="View Storage Design for Datawarehousing on Scribd" href="http://www.scribd.com/doc/15824872/Storage-Design-for-Datawarehousing" style="margin: 12px auto 6px auto; font-family: Helvetica,Arial,Sans-serif; font-style: normal; font-variant: normal; font-weight: normal; font-size: 14px; line-height: normal; font-size-adjust: none; font-stretch: normal; -x-system-font: none; display: block; text-decoration: underline;">Storage Design for Datawarehousing</a> <object codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=9,0,0,0" id="doc_220656475126519" name="doc_220656475126519" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" align="middle" height="600" width="600" > <param name="movie" value="http://d.scribd.com/ScribdViewer.swf?document_id=15824872&access_key=key-2ft6jv5df8oezfz9hl1o&page=1&version=1&viewMode=slideshow"> <param name="quality" value="high"> <param name="play" value="true"> <param name="loop" value="true"> <param name="scale" value="showall"> <param name="wmode" value="opaque"> <param name="devicefont" value="false"> <param name="bgcolor" value="#ffffff"> <param name="menu" value="true"> <param name="allowFullScreen" value="true"> <param name="allowScriptAccess" value="always"> <param name="salign" value=""> <param name="mode" value="slideshow"> <embed src="http://d.scribd.com/ScribdViewer.swf?document_id=15824872&access_key=key-2ft6jv5df8oezfz9hl1o&page=1&version=1&viewMode=slideshow" quality="high" pluginspage="http://www.macromedia.com/go/getflashplayer" play="true" loop="true" scale="showall" wmode="opaque" devicefont="false" bgcolor="#ffffff" name="doc_220656475126519_object" menu="true" allowfullscreen="true" allowscriptaccess="always" salign="" type="application/x-shockwave-flash" align="middle" mode="slideshow" height="600" width="600"></embed> </object> <div style="margin: 6px auto 3px auto; font-family: Helvetica,Arial,Sans-serif; font-style: normal; font-variant: normal; font-weight: normal; font-size: 12px; line-height: normal; font-size-adjust: none; font-stretch: normal; -x-system-font: none; display: block;"> <a href="http://www.scribd.com/tag/storage%20design%20for%20datawarehousing" style="text-decoration: underline;">Storage Design for Datawarehousing</a> </div> <div style="display:none"> <br />Storage Subsystem Design for Datawarehousing Array, Drive and RAID Selection Krishna Manoharan krishmanoh@gmail.com http://dsstos.blogspot.com 1 Again, what is this about? An attempt to show how to design a Storage Subsystem for an Datawarehouse Environment from a physical perspective. Aimed at conventional environments using standard devices such as Fibre Channel SAN arrays for Oracle Databases. The presentation will demonstrate the Array and Drive selection process using real life examples. You will be in for a few surprises! 2 Enterprise Business Intelligence (EBI) Most companies have multiple Oracle instances (such as ODS and DW) with an ETL Engine (Informatica) and Reporting tool (Business Objects) all rolled into an Enterprise Business Intelligence (EBI) Environment. ODS is the Operational Data Store (a near real time copy of the company's Transactional data) and DW is the Datawarehouse (a collection of aggregated corporate data). The ETL Engine (such as Informatica) transforms and loads data contained in the ODS into the DW. The Reporting Engine (such as Business Objects) reports off data from both the ODS and DW. This presentation covers the storage design for the DW. Typical size of an DW is around 5-10TB for a large software company. Though the typical Enterprise Warehouse is small in size, it is by no means less busy. 3 Enterprise Business Intelligence (EBI) – contd. Users Re po rts Reporting Engine Database Layer ODS One Way Replication from Source Systems DW Load ETL Engine Ex t ac tr HR Online Sales Click Stream ERP CRM Transaction Systems 4 Datawarehousing and the Storage Subsystem One of the biggest factors affecting performance in Datawarehousing is the Storage subsystem. Once the environment is live, it becomes difficult to change a storage subsystem or the layers within. So it is important to design, size and configure the Storage subsystem appropriately for any Datawarehousing Environment. 5 What is the Storage Subsystem? CPU Switch Memory System PCI The physical components of a conventional storage subsystem are System PCI Interface SAN Fabric Array Port n SAN Switch SAN Fabric Port 1 Port 2 Front End Ports to Host CPU Cache In this presentation, we talk about the Array component of the Storage Subsystem. Drives Array 6 IO Fundamentals from the Storage Subsystem READS are to the Storage Subsystem WRITES are Storage Subsystem IO in the simplest of terms is a combination of reads and writes. Reads and Writes can be Random or Sequential. 7 IO Fundamentals – contd. Random or Sequential is determined at the array level (Meaningful) Random Reads Random Read Hit - If present in the Array Cache, then occurs at wire speed. Random Read Miss - If not present in the Array Cache, hits the drives directly. This is the slowest IO operation. Sequential Reads First few reads are treated by the array as Random. Judging by the incoming requests (if determined to be sequential), then data is pre-fetched from the drives in fixed sized chunks and stored in cache. Subsequent reads are met from Cache at wire speed to the requestor. Random/Sequential Writes Normally are staged to cache and then written to disk. Will occur at wire speeds to the requestor. 8 IO Metrics Key IO Metrics are IOPS – Number of IO requests/second issued by the application. IO Size – The size of the IO requests as issued by the application. Latency – The time taken to complete a single IO Operation (IOP). Bandwidth – The anticipated bandwidth that the IO Operations are expected to consume. Latency or response time is the time it takes for 1 IO Operation (IOP) to complete. 264K 264K 264K 1024K 264K 264K 1024K 16K 1024K 1024K 16K 16K 264K 16K 16K 264K 16K 16K 16K 1024K 16K 16K 1024K Source IOPS 16K 16K 16K 264K 16K IOPS Destination 1024K 1024K 16K 16K Bandwidth is the total capacity of the pipe. Bandwidth capabilities are fixed. 9 Datawarehousing Storage Challenges Storage design in a corporate environment is typically Storage Centric - based on Capacity Requirements, not Application Requirements. When applied to Datawarehousing, this results in sub-standard user experience as Datawarehousing is heavily dependent on IO performance. 10 Profiling Datawarehousing IO - Reads 1 · Normally, in a conventional DW, you would notice many reports running against the same set of objects by different users for different requirements at the same time. From a IO performance perspective - Array capabilities along with Raid and Drive configuration determine Read performance in a Datawarehouse. Users Re ad s Typical DW (less than 10TB) · Since the size of the DW is not very big (~5-10TB) and hence the objects are relatively small in size, it is a normal tendency to place these objects on the same set of spindles (Also given the fact that today’s Drives are geared for capacity, not performance). Object1 Object3 Object2 Object4 Object5 Object7 Object6 Object8 Object9 Objectn 2 Database Objects 3 Reads · 4 · High degree of random concurrency (along with write intensive Loads) to single set of disks will absolutely kill your user experience. Due to high concurrency of the requests, about 60% of these read requests end up as Random Read Miss to the Array. Random Reads Miss is the slowest operation on an Array and require such reads to be met from the Disks. Such Random Reads can be big (1MB Sized IOP) or as small as DB Block Size. To accommodate both such requirements, throughput and latency require to be taken into consideration. · 11 Profiling Datawarehousing IO – writes From a IO performance perspective, Cache sizing & Cache reservation along with Raid and Disk configurations determine write performance in a Datawarehouse. 1 ETL Engine · In a typical DW, different write operations occur at different intervals - 24*7*365. These writes can be direct path or conventional. Fast loading of data is important to be able to present the latest information to your customer. Normally, these are driven by rigid SLA’s. Users (Temp Tablespace Writes) · · Writes DW (Typically less than 10TB) Object1 Object3 Object2 Object4 Object5 Object7 Object6 Object8 Object9 Objectn Database Objects · In an Array environment, writes are staged to cache on the Array and then written to disks. Write performance would depend on the size of the cache and the speed at which data can be written to disks. If your cache overflows (Array not being able to keep up with the Writes), then you will see an immediate spike in write response times and corresponding impact on your write operation. 3 · 2 Writes Cache (Memory) on Array · The speed at which data can be written to disks depends on the drive busyness. A combination of reads and writes occurring simultaneously to a single set of spindles will result in poor user experience. This normally happens when you place the objects on the same set of spindles without regard as to their usage patterns. · · · 12 Profiling Datawarehousing IO – Summary To summarize (in Storage Terminology) Enterprise Datawarehousing is an environment in which Performance is important, not just capacity. Read and Write intensive ( Typically 70:30 Ratio) Small (KB) to large sized IOPS (> 1MB) – for both reads and writes. Latency is very important and the IO Operations can consume significant amount of bandwidth. In order to make these requirements more meaningful, you need to put numbers against each of these terms - for e.g. IOPS, bandwidth and Latency - so that a solution can be designed to meet these requirements. 13 Starting the Design Okay I get the idea, so where do I begin? 14 Storage Subsystem Design Process If you have an existing Warehouse 1 2 3 Collect Stats from Oracle Collect Stats from System Collect Stats from Storage Correlate Stats and Summarize/Forward Project 4 If not available, then document requirements as best as you can. If you have an existing Warehouse – Collect stats from all sources and correlate to ensure you are reading it correctly. Requirements Gathering Phase 5 Identify suitable System(s) 6 Identify suitable Array Drive Drive Identify suitable SAN switches 7 RAID If not, then you would have to document your requirements based on an understanding of how your environment will be used and proceed to the design phase. Infrastructure Design 15 Storage Subsystem Design Requirements – contd. If using the data from an existing Warehouse - Do a forward projection, using these stats as raw data, for your design requirements. The existing IO subsystem would be affecting the quality of the stats that you have gathered and you need to factor this in. Separate out Reads and Writes along with the IO size. Document your average and peak numbers at Oracle Level. Anticipated IOPS – Number of IO Requests/Sec. Anticipated IO Request Size – IO Request Sizes as issued by the application for different operations. Acceptable Latency per IO Request. Anticipated Bandwidth requirements as consumed by the IOPS. 16 A Real World approach to the Design In order to make the design process more realistic, let us look at requirements for a DW for a large software company and use these requirements to build a suitable Storage Subsystem. 17 Requirements for a typical Corporate DW (Assuming 5TB in size) Performance The requirements below are as to be seen by Oracle. These are today’s requirements. It is expected that as the database grows, the performance requirements would scale accordingly. Read peaks need not be at the same time as the Write peaks. Same scenario for multiblock/single block traffic. Requirement Acceptable Latency/IO Request Expected IO Request Size Average Peak Average Bandwidth Peak 1492 MB/sec (Using 764K sized IOP) 81.2 MB/sec 262.5 MB/sec (Using 512K sized IOP) 10MB/sec Reads Multi Block Reads < = 30ms >16KB <= 1MB (Average IOP Size 764K) 1200 IOPS Writes Multi Block Writes < 20ms >16KB <= 1MB (Average IOP Size 512K) 400 IOPS Single Block Reads < = 10ms Single Block Writes < 5ms Total 5ms to 30ms 16KB to 1MB 6050 IOPS 8375 IOPS 1.2 GB/sec 1.8 GB/sec 18 16KB 16KB 4000 IOPS 450 IOPS IOPS (IO Requests/Sec) 2000 IOPS 918 MB/sec (Using 764K sized IOP) 5200 IOPS 525 IOPS 200MB/sec (Using 512K sized IOP) 650 IOPS 62.5 MB/sec 7 MB/sec Requirements for a typical EDW (Assuming 5TB in size) – contd. Capacity The database is 5TB in size (Data/Index). And so provide 10TB of usable space on Day 1 to ensure that sufficient space is available for future growth (Filesystems at 50% capacity). Scale performance requirements appropriately for 10TB. Misc IO from redo/archive/backup is not included in the above requirements. The storage subsystem needs to have the ability to handle 1024K IO Request size to prevent IO fragmentation. 19 Conventional Storage thinking Let us look at a typical Corporate Storage Design as a response to the requirements. 20 Requirements to Array & Drive Capabilities The below would be a typical response to the requirements. However as we shall we, implementing as below would result in a failure. Feature Net Bandwidth Consumed Requirement 1.8 GB/sec (Today) 3.6 GB/sec (Tomorrow) > 8375 IOPS 5ms to 30ms 10TB 1024K Determines Writes Performance 1024K Recommended Notes AMS1000 has 8*4Gb Front End Ports for a total of 4 GB/sec Bandwidth 1 Hitachi Modular Array AMS1000 146GB, 15K RPM Drives 165 Drives 10TB Usable (RAID 10) 1024K 16GB RAID10 512K IOPS Latency Capacity Max IO Size Cache Raid Levels Stripe Width Drive Specs of the 146GB, 15K RPM Drive show that the requirements can be easily met. AMS1000 supports a 1024K IOP size. Maximum Cache is 16GB RAID10 offers the best performance. 512K is the maximum offered by the array. 21 Storage Subsystem Design Process - Array What is required is a more thorough analysis of all the components of the storage subsystem and then fit the requirements appropriately to the solution. We start with the Array. This is the most vital part of the solution and is not easily replaceable. 22 Storage Array – Enterprise or Modular? Arrays come in different configurations – Modular, Enterprise, JBOD etc. Modular arrays are inexpensive and easy to manage. They provide good value for money. Enterprise arrays are extremely expensive & offer a lot more functionality geared towards enterprise needs such as wan replication, virtualization and vertical growth capabilities. As I will show later on, vertical scaling of an array is not really conducive for performance. Adding more modular arrays is a cheaper/flexible option. For this presentation, I am using the Hitachi Modular Series AMS 1000 as an example. 23 Typical Modular Array (simplified) Servers Conventional Array specs include SAN Switch Port 1 Port 2 Port x Array Front End Ports to Host Number/Speed of Host Ports (Ports available for the host system to connect to). Size of Cache. Maximum Number of Drives. Number of Raid Controllers. Number of Backend Loops for Drive connectivity. 24 Management CPU Raid Controllers Cache Disk Controllers Drives Oracle requirements to Array Specs Unfortunately Array specs as provided by the vendor do not allow us to match it with Oracle requirements (apart from Capacity). So you need to ask your Array vendor some questions that are relevant to your requirement. 25 Array Specs – contd. (Questions for Array Vendor) · 1 Port 1 Port 2 Port x · · Are these ports full speed? What is the queue depth they can sustain? Maximum IO Size that the Port can accept? Front End Ports to Host 2 Can we manipulate the Cache reservation policy between Reads and Writes? Raid Controllers Cache Management CPU 3 4 How many CPUs in all? Disk Controllers What is the bandwidth available between these components? 5 · How may drives can this array sustain before consuming the entire bandwidth of the array? Optimal Raid Configurations? Drives · Array 26 The HDS AMS1000 (Some questions answered) · 1 · Port 1 Port 2 Port x Are these ports full speed? – Only 4 out of 8 are Full Speed for a peak speed of 2048MB/ sec. What is the queue depth they can sustain? – 512/Port Maximum IO Size that the Port can accept? – 1024K · Front End Ports to Host on hy ac ps 4 T Chi 1066 MB/sec 2 2 Raid Controllers Cache 12 2132 MB/s ec Can we manipulate the Cache reservation policy between Reads and Writes? - No Management CPU 3 What is the bandwidth available between these components? – Effective Bandwidth is 1066 MB/sec U U/ CP ID ller A o 1 R ontr C n chyo 4 Ta ps i Ch .8 2 CP GB /s ec How many CPUs in all? Disk Controllers 4 ex) impl c (S e MB/s Drives 2048 5 · How may drives can this array sustain before consuming the entire bandwidth of the array? – Depends on Drive Performance Optimal Raid Configurations? – Raid 1 or Raid 10. For Raid 5 – Not enough CPU/Cache. Raid 10 – Stripe width 64K default, Upto 512K with CPM (License) · · AMS1000 27 Analyzing the HDS AMS1000 Regardless of internal capabilities, you cannot exceed 1066 MB/sec as net throughput (Reads and Writes). Limited Cache (16GB) and the inability to manipulate cache reservation means that faster and smaller drives would be required to complete writes in time. The 1066MB/sec limit and the Backend Architecture restricts the number of drives that can be sustained by this array. Limited number of CPUs and Cache rule out using RAID 5 as a viable option. 28 Matching the AMS1000 to our requirements Feature Requirement AMS1000 Capability 1066MB/sec (Theoritical) 750 MB/sec (Realistic) Recommendation Notes 1 AMS1000 = 750MB/sec 5 AMS1000 = 3.6 GB/sec 8 AMS1000 = 5.8 GB/sec Net Bandwidth 1.8 GB/sec (Today) 3.6 GB/sec (Tomorrow) Consumed IOPS Latency Capacity Scalability Max IO Size Cache Raid Levels Stripe Width > 8375 IOPS 5 Arrays (Min) 8 Arrays (Recommended) 5ms to 30ms Depends on type of IO Operation, RAID/Drive performance and Drive Capacity. 10TB Future growth 1024K Determines Writes Performance 1024K 1024K 16GB RAID 0, RAID 1, RAID 10, RAID 5 64K, 256K and 512K Need to simulate the requirements along with various drive and raid configurations. 1024K 16GB (Max) RAID 1 and RAID 10 Test to determine stripe width 1024K is supported on the AMS1000. Cache is preset at 50% Reads/Writes Not enough CPU for RAID 5 Beyond 64K, require additional License feature 29 HDS AMS1000 - Conclusions Bandwidth requirements – We would need min of 5 Arrays to meet today + future requirement. Physical hard drives and RAID configuration determine the storage capacity and other performance requirements (IOPS/Latency). Testing various configurations of Drive and RAID levels would determine how desired requirements - (IOPS/latencies) can be met. 30 Storage Subsystem Design Process – The Drives Now that we have established Array capabilities, we can move on to the Drive Selection. 31 Hard Drives Regardless of how capable your array is, the choice of the Drives will ultimately decide the performance. Ultimately all IO gets passed down to the physical hard drives. The performance characteristics (Throughput, IOPS and Latency) vary depending of the type of the IO request and the drive busyness. 32 Hard Drives – FC or SATA or SAS Choice limited by selection of array. The drive interface speed (2Gb/4Gb etc) is not relevant as the bottleneck is the media and not in the interface. SAS is a more robust protocol than FC with native support for dynamic failover. SAS is a switched, serial and point to point architecture whereas FC is Arbitrated Loop at the Backend. The IDE equivalent of SAS is SATA. SATA offers larger capacities at slower speeds. For an Enterprise DW with stringent IO requirements, SAS would be the ideal choice (If Array supports SAS). Faster the drives, better the overall performance. 33 Hard Drives – Capacities – Is bigger better? Bigger drives results in the ability to store more objects resulting in more concurrent requests and thus a more busier drive. 300GB 146GB Object1 Object3 Object2 Object4 Object5 Object7 Object6 Object8 What Capacity should I pick? 450GB Object1 Object3 Object2 Object4 Object1 Object3 Object2 Object4 Object5 Object7 Object6 Object8 Object1 Object3 Object2 Object4 Object5 Object7 Object6 Object8 All offer (supposedly) same performance 167 Random IOPS at 8K IO Size 73-125 MB/sec (Sustained) But if you compare IOPS/GB, then the true picture is revealed. 146GB drive = 1.14 IOPS/GB 300GB drive = 0.55 IOPS/GB 450GB drive = 0.37 IOPS/GB 34 Performance of Drives vis-à-vis Active Surface Usage As free space is consumed on the drive, so does the performance start to degrade. Smaller drives are a better choice for Enterprise Warehousing. 300 15K RPM Drive 250 IOPS Random 8K 200 150 100 50 0 25% 50% 75% 100% % Active Surface Usage 35 Hard Drives Specs Hard Drive specs from Manufacturers typically include the below: Capacity – 146GB, 300GB, 450GB etc Speed – 7.2K/10K/15K RPM Interface Type/Speed – SAS/FC/SATA, 2/3/4 Gb/sec Internal Cache – 16MB Average Latency – 2 ms Sustained Transfer rate – 73-125 MB/sec 36 Oracle requirements to Disk Specs Unfortunately Disk specs as provided by the vendor do not allow us to match it with Oracle requirements (apart from Capacity). Also, Hard Drives are always used in a RAID Configuration (In an Array). So you need to test various RAID Configurations and arrive at conclusions that are relevant to your requirement. 37 RAID, Raid Groups & LUNS RAID is essentially a method to improve drive performance by splitting requests between multiple drives and reduce drive busyness. And provide redundancy at the same time. RAID GROUP 1 LUN 4 LUN 5 RAID GROUP 2 LUN 1 LUN 2 LUN 3 Systems Host Systems see Luns as individual disks (presented by the Array). Array · · Luns are carved out from Raid Groups on the Array. Raid Groups are sets of disks in the Array in pre-defined combinations (Raid 1, Raid 5, Raid 10 etc). 38 RAID Levels – RAID 1 Reads from either drive will help reduce drive busyness. Minimal CPU utilization during routine/recovery operations. Not Cache intensive. RAID 1 MIRROR Since traditional RAID 1 is 1D+1P combination, it would require combining multiple such luns on the system to create big volumes. Writes require 2 IOP (Overwrite existing data) 39 RAID Levels – RAID 5 Reads will be split across drives depending on size of IO request/stripe width and help reduce drive busyness. Each additional IO operation will consume bandwidth within the array. Depending on stripe width, a request may be split between drives. CPU Intensive due to parity bit calculation. DATA1 DATA4 DATA2 DATA5 DATA3 PARITY PARITY DATA6 Writes require 4 IOP (Retrieve Data & Parity into Cache, Update Data & Parity in Cache and then Write Back into Disk) High Write Penalty and hence Cache intensive. High CPU overhead during recovery. Bigger the RAID group (More drives), higher the penalty (especially during recovery). 40 RAID 5 RAID Levels – RAID 10 Writes require 2 IOP (Overwrite existing data) · · MIRROR DATA1 DATA1 Combines both RAID1 and RAID0. Reads from either of the mirrored drives. Reads will be split across drives depending on size of IO request/stripe width. Same advantages of RAID1 with the advantage of striping (scaling across multiple drives). With a bigger stripe width, the IO requests can be met within a single drive. Traditionally Modular Arrays have been able to offer lengths of 64K stripe width only (on a single disk). This means that an IO request exceeding 64K would need to be split across the drives. Splitting across drives means more IOP’s and consuming more backend capacity (overall Array+ Drive busyness). Newer arrays (AMS2500) offer up to 512K stripe width. You can do a combination of RAID1 on the array and stripe on the system (Volume Manager) to overcome the array stripe width limitation. 41 MIRROR DATA2 DATA2 MIRROR DATA3 DATA3 MIRROR DATA4 DATA4 Drive and RAID – Initial Conclusions Since the AMS1000 supports only FC/SATA drives, we will use FC Drives. We will test using 146GB 15K RPM drives. RAID5 is not an option due to high write penalty. RAID10 on the array is not an option as the array can offer only 512K stripe width. Our preference is 1024K stripe width so that a single 1024K multiblock IO request from Oracle can (at best) be met from a single drive. This leaves us with only RAID1 on the array. We can test using RAID1 and RAID10 (Striping on the system) under various conditions. 42 RAID Level Performance Requirements The intent is to identify individual drive performance (in a RAID configuration). This will allow us to determine the number of drives that will be required to meet our requirements. We will simulate peak reads/writes to identify a worst case scenario. 43 Test Methodology to determine Drive performance We will simulate Oracle traffic for 20 minutes using VxBench. We will test on a subset (400GB) of the 5TB expected data volume. Operations Reads Multiblock IOP Single Block OP Operations Writes Multiblock IOP Single Block OP Type Asynchronous Sync Type Asynchronous Sync IO Size 784K 16K IO Size 512K 16K IOPS 156 IOPS 406 IOPS IOPS/sec 41 IOPS 51 IOPS IOPS for 20 minutes 187200 487200 IOPS for 20 minutes 49200 61200 We will generate the required IOPS and measure latency and consumed bandwidth. 44 And the results are .. RAID Config Active Surface Area /Drive Data Drives Feature IOPS Expectation 654 IOPS 147 MB/sec 5ms to 30 ms Actual 567 IOPS 141 MB/sec 46 ms Notes RAID 1 4 Concat Volumes across 4 Raid 1 Luns 68% 400 GB 8 Bandwidth Latency IOPS RAID1 8 Concat Volumes across 8 Raid 1 Luns 654 IOPS 147 MB/sec 5ms to 30 ms 654 IOPS 147 MB/sec 5ms to 30 ms 654 IOPS 147 MB/sec 5ms to 30 ms 642 IOPS 142 MB/sec 15 ms 509 IOPS 136 MB/sec 87 ms 626 IOPS 142 MB/sec 25 ms Linux Host with NOOP Elevator and Vxvm Volumes. 33% 16 Bandwidth Latency IOPS RAID 10 2 Stripe Volumes across 4 RAID 1 luns (Raid 0 on system and Raid 1 on Array) 68% 400 GB 33% 8 Bandwidth Latency IOPS RAID 10 4 Stripe Volumes across 8 RAID 1 luns (Raid 0 on system and Raid 1 on Array) Linux Host with NOOP Elevator and Vxvm Volumes (1MB Stripe Width) 16 Bandwidth Latency 45 Drive and RAID conclusions RAID1 on a Linux Host outperforms a RAID10 combination (RAID0 + RAID1 Combination). To meet our requirements, usable surface area cannot exceed 33% of a single 146 GB, 15K RPM FC Drive. For 10 TB (Day 1 + Future growth), we would need 410 drives of 146GB, 15K RPM Drives. 46 Match requirements to Array and Drive capabilities Now that we have established both Array and Drive capabilities, we can finally match these to our requirements. 47 Requirements to Array & Drive Capabilities Feature Requirement Typical Storage Design Actual Minimum Requirement 5 AMS1000 Arrays Recommended 8 AMS1000 Arrays is preferable. Notes 1 AMS1000 = 750MB/sec 5 AMS1000 = 3.6 GB/sec 8 AMS1000 = 5.8 GB/sec 1 Hitachi Modular Array Net Bandwidth 1.8 GB/sec (Today) Consumed 3.6 GB/sec (Tomorrow) AMS1000 IOPS Latency Capacity Max IO Size Cache Raid Levels Stripe Width > 8375 IOPS 5ms to 30ms 10TB 1024K Determines Writes Performance 1024K 1024K 16GB RAID10 512K AMS1000 can meet the required 1024K IOP Size. 16GB RAID1 NA RAID1 Maximum Cache is 16GB RAID1 (on a Linux system) performed better than RAID10. 48 146GB, 15K RPM Drives 165 Drives 146GB, 15K RPM Drives 450 Drives (410 + 40) 90 Drives/Array 2TB/Array (Usable space) 146GB, 15K RPM Drives 410 Drives to meet Performance 450 Drives (410 + 40) and Capacity Requirements 60 Drives/Array 450 drives (Including Spares) 1.3 TB/Array (Usable Space) Final Thoughts If we had followed the capacity method of allocating storage to the Instance, a single AMS1000 would have been sufficient. But as we discovered, we would require at least 5 arrays to meet requirements. Similarly, the initial recommendation was 165 146GB drives . However we determined that a minimum of 410 drives is required to meet performance requirements. Out of the 146GB of available capacity in the drive, only 49GB is really usable. RAID1 outperforming RAID10 is a surprise, but this may not be case on all platforms. The choice of Operating System, Volume Management and other configuration aspects do influence the final outcome. 49 The Future is Bright As always, Low Price does not equal Low Cost. If you design the environment appropriately, you will spend more initially, but the rewards are plentiful. Modular Arrays are continuously improving and the new AMS2500 from Hitachi has an internal bandwidth capability of 8GB/sec (Simplex). So a single AMS2500 would suffice for our needs from a Bandwidth perspective. Solid State Devices appears to be gaining momentum in the main stream market and hopefully within the next 2 years, HDD will be history. 50 Questions ? 51 </div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1550840387946350137.post-2751304307954857622009-03-08T15:09:00.000-07:002009-03-08T16:15:39.759-07:00Temporary Tablespaces on Oracle - going the extra mileDuring a recent performance debug session on a Linux 10.2.0.3 RAC cluster, I noticed that reads/writes to the temp tablespace were occurring at the rate of 240K/IOP. Even though direct path reads were happening at 1024K/IOP, IOP size to temp tablespaces did not exceed 240K.<br /><br />I noticed that IO sizes varied between 48K and 240K. A 10046 trace also revealed that Oracle was issuing reads/writes at the max of 15 blocks and min of 3 blocks. The db block size was 16K and thus 15 blocks equals 240K IOP size and 3 blocks equals 48K.<br /><br />At first, I thought this was a limitation on the raw volumes which we were using for the temporary tablespaces, but a test using dd to the raw volumes showed that we could go upto 4M/IOP without the IO request getting broken up (this was the limit set at the volume level).<br /><br />I opened a tar with Oracle and it did not get me any where. I found it odd that Oracle would use 15 block sized IOPs. 15 is an odd number and not Oracle style. Just by sheer luck, I was going through the hidden parameters list and came across 2 parameters<br /><br /><span style="font-size:100%;"><span style="font-family:courier new;">_smm_auto_max_io_size 240 Maximum IO size (in KB) used by sort/hash-join in auto mode <br /><br /></span><span style="font-family:courier new;">_smm_auto_min_io_size 48 Minimum IO size</span></span><span style="font-size:100%;"><span style="font-family:courier new;"> (in KB) used by sort/hash-join in auto mode </span></span> <br /><br />I had found the problem - Oracle had set hash-joins and sorts that spill to the disk to not exceed 240KB. The minimum IO size was 48K and the maximum was 240KB. It fit in perfectly with the behavior I noticed.<br /><br />Note that the parameter mentions "in auto mode" - so if you work area policy is set to auto, then these settings will take effect. The older settings - _hash_multiblock_io_count/_sort_multiblock_read_count has no effect if using auto work area policy and probably still control the hash/sort behaviour when in manual mode.<br /><br />To me, 240K and 48K are on the lower end of a suitable size for an IOP. I could understand as to why Oracle would keep it at such low values - IO to the temp tablespaces ( if shared) is very much random and with bigger IOP sizes, it can cause a slowdown of your operation. But if your temp tablespaces are well designed, then you could easily use bigger sized IOP requests.<br /><br />With bigger sized requests, you would reduce the number of waits. But there is a catch with using bigger sized requests - you will end up spending more time on each wait. There will be a sweet spot wherein the net result will be to your choice.<br /><br />Our Linux RAC cluster was using temporary tablespace groups (4 tablespaces) and had a well designed IO subsystem. Each temp tablespace had a single temp file (raw volume) of 62GB and there were 4 such temp tablespaces. Being a 4 node RAC cluster, we had configured 4 tablespaces assuming that each node would prefer a separate tablespace (node affinity).<br /><br />Each of the raw volumes was a Raid 1+0 configuration using 4*143GB, 15K RPM drives. The array was an Hitachi AMS1000 with 16GB of cache. The stripe width was 1M on the Raid 1+0 volume. There was a lot of space wasted, however, we were interested in performance and so space was not a concern. So ~250GB of Temp space was laid out across 16 drives (143GB, 15K RPM FC).<br /><br />In order to test the performance with different sized IOPS, I took one of our "Taste the Thunder" sql's which does an insert of ~460 million rows after doing some large hash-joins on tables with + billion rows. Normally this sql takes around 27 minutes to complete. We do parallel dml and the insert is split across all the 4 nodes.<br /><br />With the default settings of 240K/48K. (click to enlarge)<br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfjV09nlqGpYN0QM7qqHe_a0gjPgYuheBrQcANB2w2A_Gmb56XLgJ-YLxeN2f0fvUtFb4X9Yce3MRGypVKzlaLEWzMvxBZaX09NLwxXykEhJYY_aABKaRWenQPr6XgeYOINOK7YCULPvru/s1600-h/run1.JPG"><img style="cursor: pointer; width: 400px; height: 112px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfjV09nlqGpYN0QM7qqHe_a0gjPgYuheBrQcANB2w2A_Gmb56XLgJ-YLxeN2f0fvUtFb4X9Yce3MRGypVKzlaLEWzMvxBZaX09NLwxXykEhJYY_aABKaRWenQPr6XgeYOINOK7YCULPvru/s400/run1.JPG" alt="" id="BLOGGER_PHOTO_ID_5310953723590343730" border="0" /></a><br />Changing max/min to 1024K/512K (click to enlarge)<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg0i8m_a-oxj_zPJIG53FGws7BPBmQfd-gzvVj-mRikkUwQsXjP_417Ll1g1utGlmVAnlfFLggXV9L4mlmrimA8x8x80kp0ROau3BjGNekp8WiA9jfkCHqO80pYNLLkzg6CNzmcgkLstDhh/s1600-h/run2.JPG"><img style="cursor: pointer; width: 400px; height: 112px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg0i8m_a-oxj_zPJIG53FGws7BPBmQfd-gzvVj-mRikkUwQsXjP_417Ll1g1utGlmVAnlfFLggXV9L4mlmrimA8x8x80kp0ROau3BjGNekp8WiA9jfkCHqO80pYNLLkzg6CNzmcgkLstDhh/s400/run2.JPG" alt="" id="BLOGGER_PHOTO_ID_5310953978948098866" border="0" /></a><br />Changing max/min to 1024K/1024K (click to enlarge)<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh8yCQwdz3j3PzWV3fmoUQBL4VFsTIh8HEpf65drcUfHy8aCXr_Q5QVtImPqXU0cMt3DUdYTJBSOSx9IHVOqVHvtkpmD-YO3E9cky-hIPq0DoYDpj8V2jBYQTTJNWacA9Jb2gB7rXVqpsmI/s1600-h/run3.JPG"><img style="cursor: pointer; width: 400px; height: 112px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh8yCQwdz3j3PzWV3fmoUQBL4VFsTIh8HEpf65drcUfHy8aCXr_Q5QVtImPqXU0cMt3DUdYTJBSOSx9IHVOqVHvtkpmD-YO3E9cky-hIPq0DoYDpj8V2jBYQTTJNWacA9Jb2gB7rXVqpsmI/s400/run3.JPG" alt="" id="BLOGGER_PHOTO_ID_5310955186908166050" border="0" /></a><br />The below summarizes the findings:<br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiE4gUUN5JNFEE_ofojeAlNL2Rt00lnNLh8I4wXdAwMSIcYyNfvGxMpEng6MLceRNVoz8zTQJGdbkRJjK57LWf0uV6uwBJ0ADGI6IvBqDDjJbQXX6OdwlofOlZgvTS2sst8lPvii-lABh5p/s1600-h/summart.bmp"><img style="cursor: pointer; width: 400px; height: 112px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiE4gUUN5JNFEE_ofojeAlNL2Rt00lnNLh8I4wXdAwMSIcYyNfvGxMpEng6MLceRNVoz8zTQJGdbkRJjK57LWf0uV6uwBJ0ADGI6IvBqDDjJbQXX6OdwlofOlZgvTS2sst8lPvii-lABh5p/s400/summart.bmp" alt="" id="BLOGGER_PHOTO_ID_5310955749633230882" border="0" /></a><br />The best results where when we set the max/min to 1024K/512K. We had an improvement of 22% over the default.<br /><br />As expected, with increasing IOP sizes, the number of waits reduce, however the average wait time increases. The sweet spot in our case was with a max/min setting of 1024K/512K.<br /><br />With a max/min setting of 1024K/1024K, even though the reads were very fast, the write times increased exponentially - probably the array did not have enough cache to handle such large sized writes.<br /><br />As always, it is best to test and see how it affects performance.Unknownnoreply@blogger.com4tag:blogger.com,1999:blog-1550840387946350137.post-18151337218024910562009-03-07T11:57:00.000-08:002009-03-07T15:37:50.585-08:00Linux, Oracle RAC and the Bonding conundrumI recently ran into a problem wherein after patching a 4 node Linux RAC cluster, 2 out of 4 instances would not startup. I could not start the instances manually either.<br /><br />Both querying and trying to restart the crs hung.<br /><br />On closer look, the crsd.log had entries like<br /><br />------------------------------------------------------<br /><span style="font-size:85%;"><span style="font-family:courier new;">2009-03-05 23:38:28.600: [ CRSRTI][2541399584]0CSS is not ready. Received status 3 from CSS. Waiting for good status .. </span><br /><br /><span style="font-family:courier new;">2009-03-05 23:38:29.814: [ COMMCRS][1084229984]clsc_connect: (0xb76f00) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_tus1dwhdbssex04_CODSS))</span><br /><br /><span style="font-family:courier new;">2009-03-05 23:38:29.814: [ CSSCLNT][2541399584]clsssInitNative: connect failed, rc 9</span></span><br /><br />------------------------------------------------------<br />I checked the OCSS log and it had entries like the below:<br /><br /><span style="font-size:85%;"><span style="font-family:courier new;"> -------------------------------------------------------</span><br /><span style="font-family:courier new;">CSSD]2009-03-05 23:40:01.566 [1273047392] >ERROR: clssgmSlaveCMSync: reconfig timeout on master 1</span><br /><span style="font-family:courier new;">[ CSSD]2009-03-05 23:40:01.566 [1273047392] >TRACE: clssgmReconfigThread: completed for reconfig(16), with status(0)</span><br /><span style="font-family:courier new;">[ CSSD]2009-03-05 23:44:31.610 [2538397376] >ERROR: clssgmStartNMMon: reconfig incarn 16 failed. Retrying.</span><br /><span style="font-family:courier new;">----------------------------------------------------------</span></span><br /><br />Searching metalink showed no hits, neither did google.<br /><br />The alertlog for CRS showed that the voting disks was online and the system showed that the interconnects were up. The cluster was setup to use Network interfaces which were bonded together in an Active-Passive mode ( mode=1).<br /><br />I tried the usual methods - deleting the pipes in /var/tmp/.oracle and restarting the systems a couple of times, however it did not seem to fix the problem. Any attempts to restart crsd from /etc/init.d/init.crsd or crsctl failed. The commands would simply hang and so the nodes had to be force restarted - either by reboot or killing the crs/ocss daemons.<br /><br />Finally, I checked the active member of the bond interface for the cluster interconnect and found that on 2 of the nodes, the active interface was different from the other 2 nodes. You can identify the active interface by looking into the /proc/net/bonding/bond<interface> file.<br /><br /><span style="font-size:85%;"><span style="font-family:courier new;">---------------------</span><br /><span style="font-family:courier new;">[oracodss@tus1dwhdbssex02] /apps/oracle/home/crs10203/log/tus1dwhdbssex02$ cat /proc/net/bonding/bond1</span><br /><span style="font-family:courier new;">Ethernet Channel Bonding Driver: v2.6.3 (June 8, 2005)</span><br /><br /><span style="font-family:courier new;">Bonding Mode: fault-tolerance (active-backup)</span><br /><span style="font-family:courier new;">Primary Slave: None</span><br /><span style="font-family:courier new;">Currently Active Slave: eth5</span><br /><span style="font-family:courier new;">MII Status: up</span><br /><span style="font-family:courier new;">MII Polling Interval (ms): 100</span><br /><span style="font-family:courier new;">Up Delay (ms): 0</span><br /><span style="font-family:courier new;">Down Delay (ms): 0</span><br /><br /><span style="font-family:courier new;">Slave Interface: eth1</span><br /><span style="font-family:courier new;">MII Status: up</span><br /><span style="font-family:courier new;">Link Failure Count: 5</span><br /><span style="font-family:courier new;">Permanent HW addr: 00:1c:23:bf:6e:73</span><br /><br /><span style="font-family:courier new;">Slave Interface: eth5</span><br /><span style="font-family:courier new;">MII Status: up</span><br /><span style="font-family:courier new;">Link Failure Count: 5</span><br /><span style="font-family:courier new;">Permanent HW addr: 00:15:17:49:75:33</span><br /><span style="font-family:courier new;">-------------------------------------------------</span><br /></span><br />For the 2 nodes which were up, eth5 was the primary slave, whereas for the 2 down nodes, eth1 was the primary.<br /><br />So I deleted all the pipes in /var/tmp/oracle, rebooted the 2 down nodes and changed the primary slaves on these 2 nodes to eth5. The instances came back up immediately.<br /><br />You can change the primary slave using ifenslave or by hard-coding the interface when bringing up the bond interface.<br /><br /><span style="font-size:85%;"><span style="font-family:courier new;"># ifenslave -c bond1 eth5 </span><br /><br /><span style="font-family:courier new;">------modprobe.conf-----------</span><br /><span style="font-family:courier new;">install bond1 /sbin/modprobe bonding -o bond1 miimon=100 mode=1 primary=eth5</span><br /><span style="font-family:courier new;">--------------------------------</span><br /></span><br />Active-Passive mode is a common method of bonding in which there is no load-balancing, but only failover (in case a link fails). In a typical HA environment, each interface on the bond is setup to be cabled to a different switch which provides switch level failover capabilities as well. In the Active-Passive mode, the passive link does not even arp, so you would not see any MAC addresses on the switch port for this interface. There is complete isolation between interface 1 and 2.<br /><br />Imagine a scene wherein Node A uses eth1 as Active and Node B uses eth5 as Active. Even though the Nodes are cabled to the same switches and the links are up/active, Node A will not be able to communicate to Node B.<br /><br />Here is the funny part - If you setup the cluster interconnect to point to a non-existent interface, I have seen Oracle using the public interface for cluster communications. You will see entries in the instance alert logs mentioning that the cluster interconnect is non-existent/incorrect and so using the public interface.<br /><br />However, if you set the cluster interconnect to an available and up interface and it is not able to reach the other nodes using the cluster interconnect, it does not try to use the public interface to check connectivity (in the scenario that the vote disks are online/available). Maybe this is a bug? Not surprising at all.<br /><br />One would assume that a failure of interconnects when the voting disks are online/available should result in more legitimate error logs other than cryptic messages that fill up the CRS logs. This is another kind of split brain condition and probably should be well documented.<br /> Maybe Oracle needs to learn from VCS or other HA products which have been in the market longer and are more stable.<br /></interface>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1550840387946350137.post-461860237422094642009-02-28T20:55:00.000-08:002009-03-09T12:33:13.850-07:00UltraSparc T1 versus Ultrasparc IV+There was an interesting thread in <a href="http://www.freelists.org/post/oracle-l/Oracle-Performance-on-Sunfire-T2000,1">orafaq </a>as to poor oracle performance on a T2000/T5240 versus a older V440. While I did test the T2000 when it was first released (2005?) for one of our Datawarehouse Oracle databases and found it to be of magnitudes slower than the UltraSparc IV+, I never really ventured to find out as to why it was slow.<br /><br />The thread on orafaq really piqued my interest and I decided to do some simple tests. I did not use oracle nor any compute intensive apps since the T2000 is handicapped vis-a-vis Floating Point Instructions capabilities. While it is well known that the Tx series chips are not designed for long running single threaded processes, I wanted to see why exactly it was not performing as well as other CPU's.<br /><br /><span style="font-weight: bold;font-size:130%;" >The Specs</span><br /><br />The differences in CPU are plenty, but the below would probably be of interest:<br /><br /><span style="font-weight: bold;">0. Cores</span> - The UltraSparc IV+ CPU has 2 UltraSparc III CPUs bolted together as a dual core CPU. The T2000 has 8 cores put together in a single CPU. The 8 cores each support 4 HW threads with only 1 thread/core running at a time.<br /><br />The OS sees each of the HW threads as individual CPUs and schedules processes (LWP) on them. Internally, the T2000 cores switch the HW threads every clock cycle and if any of the HW threads stall due to memory latency. To compare with a conventional CPU, it would be fair to say that 8 cores = 8 CPUs (and not 32 CPUs) since only 1 HW thread can be active on a core at a given point in time.<br /><br /><span style="font-weight: bold;">1. Core Speeds</span> - UltraSparc IV+ is at 1.5Ghz whereas the T1 is at 1.2Ghz.<br /><br /><span style="font-weight: bold;">2. Pipeline</span> - An UltraSparc IV+ Core has 14 stage pipeline and is 4 way superscalar versus the T1 Core is 6 stage pipeline and a scalar design. The T1 core supports 4 threads, but executes only 1 thread at a time. It switches the threads every cycle (as long as there is more than 1 thread to run) or if a thread is stalled.<br /><br /><span style="font-weight: bold;">3. Cache</span> -<ul><li>L1 Cache/core- 64K/64K (I/D) on UltraSparc IV+ whereas 16K/8K (I/D) on the T1.</li><li>L2 Cache/CPU - 2M shared between 2 cores on UltraSparc IV+ whereas 4M shared between 8 cores (32 threads) on T1.</li><li>L3 Cache/CPU - 32M on UltraSparc IV+ versus none on T1.<br /></li></ul><span style="font-weight: bold;font-size:130%;" >The test</span><br /><br />I did a simple word count (wc -l) of a 2GB file on both the T2000 (1 CPU @ 1.2 Ghz with 8 cores, 32 threads) and a V490 (4 CPU @ 1.5Ghz with 2 cores each). I also tested on a V240 (2 UltraSparc III CPU at 1 Ghz). The UltraSparc IV+ is really 2 UltraSparc III CPU's bolted together. So both UltraSparc IV+ and III have the same pipelines, but with differences in the Cache levels.<br /><br />In order to level the playing field, I did the following.<br /><br />0. Created processor sets on all the 3 systems.<br /><ul><li>T2000 - Set 1 with 1HW thread from 1 core. Set 2 with the remaining HW threads from the same core.<br /></li><li>V490 - Set 1 with 1 core from 1 CPU. Set 2 with the 2nd core on the same CPU.</li><li>V240 - Set 1 with 1 CPU.<br /></li></ul>1. Bound the test process to processor Set 1 on all the systems. I did this to eliminate/reduce interprocessor cross calls & thread migrations. The intention was to ensure the L1/L2 Cache on the core served only this thread and to reduce cache miss.<br /><br />2. With creating Processor Set 2, Solaris will not run any other process on this set unless specifically instructed to do so. This is to ensure that this core on T2000/CPU on V490 is completely dedicated to servicing the test process.<br /><br />3. Changed the process to FX priority and increased the time-quanta to 1000ms. This is to eliminate involuntary context switching and make sure the process gets sufficient time to complete it's activities.<br /><br />4. Disable interrupts on processor Set 1 and 2 on all systems. This is to prevent the bound Core/CPU from processing interrupts during the running of the process.<br /><br /><span style="font-weight: bold;font-size:130%;" >Monitoring</span><br /><br />I captured stats using the below:<br /><br />1. mpstat - To capture CPU stats<br /><br />2. cputrack - To capture instruction cycles, cpu ticks and L2 D miss.<br /><ul><li>For T2000 - cputrack -evf -t -T1 -c pic0=L2_dmiss_ld,sys,pic1=Instr_cnt,sys -p <pid><br /></pid></li><li>For V490 - cputrack -evf -t -T1 -c pic0=L2_rd_miss,sys,pic1=Instr_cnt,sys -p <pid><br /></pid></li><li>For V240 - cputrack -evf -t -T1 -c pic0=EC_rd_miss,sys,pic1=Instr_cnt,sys -p <pid><br /></pid></li></ul>3. ptime - To capture the time taken to do the word count (wc -l). The command was<br /><ul><li>ptime wc -l test_file</li></ul><br /><span style="font-weight: bold;font-size:130%;" >The results</span><br /><br />As expected, the UltraSparc IV+ CPU was 3.5 times faster than the T1. The UltraSparc III CPU was 2 times faster than the T1.<br /><br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjbq4PplkXP6oVUImUvlnr6J8hhyNsxRMkPYnXBnHmeLvvA2Ujblm7qlJuqe14_h4EIHdeoWx02-H2J4putejlXYRJt6s0-plJuKi1VWatQ5onxkuDfqUdp543Tzr-vC2CDLtEkzeITtfal/s1600-h/CPU-compare.GIF"><img style="cursor: pointer; width: 400px; height: 60px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjbq4PplkXP6oVUImUvlnr6J8hhyNsxRMkPYnXBnHmeLvvA2Ujblm7qlJuqe14_h4EIHdeoWx02-H2J4putejlXYRJt6s0-plJuKi1VWatQ5onxkuDfqUdp543Tzr-vC2CDLtEkzeITtfal/s400/CPU-compare.GIF" alt="" id="BLOGGER_PHOTO_ID_5308388241589166786" border="0" /></a><br /><br />Notice the instructions processed/cycle - When all things are the same, the instructions processed/CPU cycle would determine as to how fast your application will run.<br /><br />In the tests, the T2000 lags considerably behind the other systems. This could be due to the fact that it has only a 6 stage scalar pipeline, whereas the others are 14 stage and superscalar.<br /><br />Normally, for any CPU, ideally, it would process instructions at the rate of 1 instruction/cycle (assuming there is no stall due to memory latency). However in reality, this does not happen. CPU designers get over this with various tricks such as deep pipelines, superscalar architectures, big caches etc.<br /><br />Surprisingly, even though the T1 Chip has smaller L2 Cache than the UltraSparc III CPU, it has fewer cache misses/cycle. This could be due to the fact that it is processing less instructions/second and when you look at the L2 D miss/Instruction, it becomes evident that this is indeed the reason.<br /><br />For single threaded, long running processes, a CPU with deep pipelines and superscalar architecture will be the best fit. For heavily multithreaded processes, wherein a single thread does a small amount of work and exits, the T2000 will scale better than the Ultrasparc IV+.<br /><br />In either case, for moderate to medium work-loads with any application where latency is important, the UltraSparc IV+ or Intel/AMD cpus will be the best fit. As you start increasing the work-load and expect scalability to be sustained, then the T2000 will provide more linear scalability than the UltraSparc IV+ platform.<br /><br />A T2000 is like a truck and can carry light, heavy to super heavy loads at 60MPH. What ever be the load, it will run only at 60MPH.<br /><br />Whereas a UltraSparc IV+ or such CPUs are like a Porsche or a BMW - no load, light to medium loads at speeds >>> 60MPH but as the load increases, you will see the speed drop and ultimately hit 0 MPH (evident with X-86).<br /><br />I have used the T2000 for Tibco Business Works with great success. It fits the bill perfectly for this application. However, for datawarehousing, it would be a very poor fit.Unknownnoreply@blogger.com4tag:blogger.com,1999:blog-1550840387946350137.post-11354409636739236042008-12-30T10:34:00.000-08:002009-01-29T20:34:14.874-08:00Host/Lun Queue Depths, Command Tag Queuing from an oracle perspective<p>There was an interesting discussion in linkedin as to how a performance/capacity planning engineer can be of value, especially in an economy which is trending downward.</p><p>There was a comment that todays infrastructures are well below capacity and under-utilized. This unfortunately, is the truth in most environments I have seen. Most environments do not even do any kind of pro-active performance analysis/capacity planning. </p><p>Without any kind of definitive proof, if there is a performance issue, the first comment one can expect to hear is that the hardware needs to be upgraded. And if in datawarehousing, then the storage array is generally the first in the line. </p><p>This is because one of the most common waits (apart from CPU Time) from a datawarehousing perspective would sequential/scattered/direct-path reads, direct path writes and for logging operations - log file parallel writes.</p><p>In my experience, most I/O issues are configuration related rather than an under performing array. One needs to look at the entire I/O subststem and optimize it for oracle requirements.</p><p>A complete storage subsystem would consist of all the layers starting with an oracle read/write call - </p><p><span style="font-size:85%;"> Oracle Read/Write call</span><span style="font-size:85%;"> -</span><span style="font-size:85%;">-> Filesystem --> Volume Manager --> Multipathing --> scsi driver</span><span style="font-size:85%;">--> HBA </span><span style="font-size:85%;">---> Array Cache on controller --> Processed by Array controller --> Lun</span></p><p>One aspect of storage that is often misunderstood is the Queue Depth and how it can impact Async I/O. </p><p>To start with, to refresh memories, </p><p><strong><span style="color:#cc0000;"><span style="font-size:130%;">Async I/O</span></span></strong> - From the Oracle 10g performance tuning guide</p><p><em>"With synchronous I/O, when an I/O request is submitted to the operating system, the writing process blocks until the write is confirmed as complete. It can then continue processing. </em></p><p><em>With asynchronous I/O, processing continues while the I/O request is submitted and processed. Use asynchronous I/O when possible to avoid bottlenecks."</em></p><p>From the 10g reference guide -</p><p><em>"Parallel server processes can overlap I/O requests (Async I/O) with CPU processing during table scans."<br /></em></p><p>In the simplest of terms, Async I/O is non-blocking I/O. The session can submit I/O requests and wait for confirmation that the request has been received. Once acknowledged, then it can continue on with other activities such as CPU processing. </p><p>Async I/O is enabled by using the ODM interface or setting the FILESYSTEMIO_OPTIONS to SETALL. SETALL enables both Async and Direct I/O. Direct I/O bypasses the filesystem buffer cache when doing reads/writes. </p><p>I have no experience with using filesystemio_options as all the instances I have been working with have used VRTS ODM. In order to see if you are indeed using ODM interface, a quick and simple check would to verify ODM stats in /dev/odm/stats. I would assume that if using filesystemio_options, a truss or strace would reveal aio_read/write calls.</p><p><strong><span style="color:#cc0000;"><span style="font-size:130%;">Queue Depth</span></span></strong> - <em>It is the number of outstanding I/O requests to a device for which a response has not been received from the device. There are 2 parts to queue depth - host queue depth and lun queue depth.</em></p><ul><li><em>Host queue depth - Number of transactions waiting to be submitted to a device. </em></li><li><em>Lun queue depth - Number of transactions submitted to and being processed by the device.</em></li></ul><p>Queue depth = host queue depth + lun queue depth</p><p>Asynchronous I/O requests are issued by the processes and this in turn is submitted to the lun. Depending on how lun queue depth has been configured, the requests are split into the host queue and lun queue. </p><p>For e.g. as shown in below figure - </p><p>We have 4 parallel processes issuing reads to the same lun. The lun queue depth has been set to 2. This means that out of the 4 reads, 2 reads would be submitted to the lun immediately whereas the other 2 would be in the host queue. The 2 requests in the host queue are waiting to be moved to the lun queue as soon as the lun queue is freed up. In order to track the requests back to the requesting process, command tag queing is employed. </p><p><br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_Dd4LUM0teUra2HcJg8OcHwCSuDCWkAie0T8UZL6G8aXEkiuvtsvRlsXGneOyIYmyxQmOEIwl-1t5-zXoEp62Kjj6D3CfiyBH1Xa50xXclkrmXo0rKVRN6EKxTj3S_bFutzH6JMPEMY_o/s1600-h/Queue+depth.jpg"><img style="cursor:pointer; cursor:hand;width: 400px; height: 222px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_Dd4LUM0teUra2HcJg8OcHwCSuDCWkAie0T8UZL6G8aXEkiuvtsvRlsXGneOyIYmyxQmOEIwl-1t5-zXoEp62Kjj6D3CfiyBH1Xa50xXclkrmXo0rKVRN6EKxTj3S_bFutzH6JMPEMY_o/s400/Queue+depth.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5286874064884094258" /></a><br /><br /><strong><span style="color:#cc0000;">Command Tag Queuing</span></strong> - Tagging a I/O request in the lun queue allows the kernel to associate the specific I/O request with the requestor. This in turn allows the SCSI device to disconnect from the host and process the submitted I/O requests. This allows for better bandwidth utilization on the PCI bus. </p><p>Command Tag Queuing also can specify where exactly in the queue you want the new IO request to be placed - at the tail end, head end or to be executed in a specific order.<br /></p><p>In the above figure, each of the 2 requests submitted to the lun is tagged so that it can be tied back to the original requestor (parallel process). At the array level, these 2 requests to the lun are sorted/merged (re-ordering) to ensure optimal head movement when submitting to the actual physical devices behind the lun. </p><p>You can see the lun queue depth and host queue depth using iostat.</p><p><span style="font-size:85%;"><span style="font-family:courier new;">mkrishna@oradb:> iostat -xnM |more<br /> extended device statistics <br /> r/s w/s Mr/s Mw/s <strong><span style="color:#ff0000;">wait actv</span></strong> wsvc_t asvc_t %w %b device<br /> 0.1 0.6 0.0 0.0 <strong><span style="color:#ff0000;">0.0 0.0</span></strong> 0.0 22.4 0 0 c1t10d0</span></span></p><p><span style="font-size:100%;"><strong>wait</strong> - average number of transactions waiting for service (queue length) or host queue depth.<br /> <br /><strong>actv</strong> - average number of transactions actively being serviced (removed from the queue but not yet completed) or lun queue depth.</span></p><p><span style="font-size:100%;">Also sar -d will show you the host queue.</span></p><p><span style="font-size:85%;"><span style="font-family:courier new;">mkrishna@oradb:> sar -d |more<br /><br />SunOS tus1dwhdbspex01 5.9 Generic_122300-16 sun4u 01/02/2009<br /><br />00:00:00 device %busy <strong><span style="color:#ff0000;">avque</span></strong> r+w/s blks/s <strong><span style="color:#ff0000;">avwait</span></strong> <strong><span style="color:#ff0000;">avserv</span></strong><br /><br />00:05:00 sd9 0 <strong><span style="color:#ff0000;">0.0</span></strong> 1 16 <strong><span style="color:#ff0000;">0.0 23.2</span></strong><br /></span></span></p><p><span style="font-size:100%;"><strong>avque</strong> - average number of requests outstanding during that time (host queue)</span></p><p><span style="font-size:100%;"><strong>avwait</strong> - the average time spent on the avque before it can be moved to the lun queue.</span></p><p><span style="font-size:100%;"><strong>avserv</strong> - the average service time in milliseconds for the device.</span></p><p><span style="font-size:100%;">From what I have observed (atleast on Solaris), sar seems to be more accurate in reporting the host queue than iostat. <br /></span></p><p><span style="font-size:100%;">Depending on the lun queue depth that has been configured, it is very much possible that many I/O requests are simply sitting in the host queue waiting to be moved into the lun queue so that they can be serviced. The wait column in iostat or the avque column in sar -d would give you the exact number of requests in the host queue.</span></p><strong><span style="color:#cc0000;">For optimal Async I/O, lun queue depths must be set high enough so that process I/O requests are not waiting in the host queue.</span></strong> It makes sense to push the host queue onto the lun queue because the array can act on these requests and do a sort/merge (as possible) rather than simply sitting in the host queue and doing nothing. Bigger lun queue depths means the array has more requests in the pipe-line which it can act upon aggresively to optimize head movement. The lun queue depth has significant impact on the throughput. <p>But too high lun queue depths and you will start seeing scsi reset error messages on the system. So you need to strike a balance between too high and too low.</p><p><span style="font-size:100%;">Coming back to the problem definition, tradionally Storage vendors and Unix sys admins recommend setting the lun queue depth to ridiculously low values. This is because storage vendors never disclose the total number of outstanding requests that can be serviced by their controllers. They take worst case scenarios (maximum hosts/port and all submitting requests at the same time) and make a rule that the maximum outstanding requests/lun can not exceed 8. </span></p><p><span style="font-size:100%;">This is the basis for the sd_max_throttle set to 8 on many sun systems restricting the lun queue depth to 8. The default for the sd_max_throttle is 256 (which should be the max ever set).</span><span style="font-size:100%;"><br /></span></p><p><span style="font-size:100%;">It makes more sense to restrict the queue depths at the HBA level rather than the sd level (keeping sd_max_throttle at the default). </span></p><p><span style="font-size:100%;">For e.g, the emulex lpfc drivers can be configured (lpfc.conf) to have per target and per lun restrictions on the queue depth. You can set both lun and target level queue depths. The values depend on the array, raid group configuration and lun breakdown, number of hosts/port etc. </span></p><p><span style="font-size:100%;">------------CUT--------------</span></p><p><span style="font-size:100%;"># lun-queue-depth [1 to 128] - The default value lpfc will use to<br /># limit the number of outstanding commands per FCP LUN. This value<br /># is global, affecting each LUN recognized by the driver, but may be<br /># overridden on a per-LUN basis (see below). RAID arrays may want<br /># to be configured using the per-LUN tunable throttles.<br />lun-queue-depth=30;<br /><br /># tgt-queue-depth [0 to 10240] - The default value lpfc will use to<br /># limit the number of outstanding commands per FCP target. This value<br /># is global, affecting each target recognized by the driver, but may be<br /># overridden on a per-target basis (see below). RAID arrays may want<br /># to be configured using the per-target tunable throttles. A value<br /># of 0 means don't throttle the target.<br />tgt-queue-depth=256;</span></p><p><span style="font-size:100%;">--------------CUT-------------<br /><br /></span></p><p><span style="font-size:100%;">Every environment is different and so the optimal queue depths would differ. One needs to test, monitor using iostat/sar and see which works best. For our datawarehousing environments, I normally set the lun queue depth to 30 and target queue depth to 256. With these settings, I have not seen many pending requests (5-10 during maximum load) in the host queue for our environments. Datawarehousing mostly consisting of lesser number of large I/O requests rather than OLTP environments, these values(30 and 256) are mildy conservative in nature. </span></p><p><span style="font-size:100%;">The arrays we use are Hitachi Modular storage (AMS1000/9585). Arrays are shared between the datawarehouse instances (not more than 3 instances/array) and each instance is assigned storage on multiple arrays. Hitachi Modular arrays are mid-range and really are not high on the specs (8/16GB Cache, 8 ports, 2 controllers (Active/Passive)). </span></p>Unknownnoreply@blogger.com3tag:blogger.com,1999:blog-1550840387946350137.post-65360590755161614102008-12-28T19:30:00.000-08:002008-12-30T17:25:02.944-08:00Analyzing the impact of the Vxfs filesystem block size on Oracle<p>I am usually asked as to what should be the ideal Vxfs filesystem block size for an Oracle DB block size of 16K. I always reply - 8K (maximum on Vxfs). </p><p>All along, my reasoning was that if using say 1K filesystem block size, then a 16K oracle block read would end up as 16/1K IO requests to the filesystem and the same for writes. With a filesystem block size of 8K, you would be reduced from 16/1K requests to 16/8K requests - or so I thought..<br /></p><p>I decided to test to see what exactly was happening and it proved that I was wrong – at least with respect to Vxfs.<br /></p><p><strong>Firstly some background about Vxfs –</strong><br /></p><p>Vxfs is an extent based filesystem – meaning it allocates space to files not as blocks, but as extents. Extents are contiguous set of filesystem blocks. Extent sizes vary and also the method of creation of a file greatly influences extent sizing. As a file grows, more extents are added to the file. </p><p>The interesting part about Vxfs and extents is that IO is never split across extents and a request for contiguous set of blocks within an extent is satisfied with a single request. If split across extents, then it will result in multiple IO requests – quite similar to how db file scattered read would split a request between oracle extents. From the Vxfs guide -</p><p><em>"By allocating disk space to files in extents, disk I/O to and from a file can be done in units of multiple blocks. This type of I/O can occur if storage is allocated in units of consecutive blocks. For sequential I/O, multiple block operations are considerably faster than block-at-a-time operations. Almost all disk drives accept I/O operations of multiple blocks."<br /><br /></em></p><p><strong>So coming back to Oracle – some test scenarios</strong><br /></p><p>I decided to test and see for myself.</p><p>The environment is Solaris 9 on a E4900 with Storage Foundation for Oracle Enterprise Edition. Oracle is 10.2.0.3 using VRTS ODM. <br /></p><p>I created 2 tablespaces – one on a 1K filesystem and the other on a 8K filesystem. Each had 1 datafile of size 5g. </p><p><br />Identical tables with ~1000 rows were created on both the tablespaces. Indexes were created on both tables on relevant columns.<br /><br /><strong>On a 1K Filesystem block size and a 16K DB Block size</strong><br /></p><p>First to confirm 1K block size</p><p><span style="font-family:courier new;"><span style="font-size:85%;">root@oracle:> fstyp -v /dev/vx/rdsk/oracledg/test1k</span><span style="font-size:85%;"><br /></span><span style="font-size:85%;">vxfs</span><span style="font-size:85%;"><br /></span><span style="font-size:85%;">magic a501fcf5 version 6 ctime Sat Dec 27 22:52:33 2008</span><span style="font-size:85%;"><br /></span><span style="font-size:85%;">logstart 0 logend 0</span><span style="font-size:85%;"><br /></span></span><span style="font-family:courier new;"><span style="font-size:85%;"><strong><span style="color:#ff0000;">bsize 1024</span></strong></span></span><span style="font-family:courier new;"><span style="font-size:85%;"> size 15728640 dsize 0 ninode 15728640 nau 0</span><span style="font-size:85%;"><br /></span><span style="font-size:85%;">defiextsize 0 ilbsize 0 immedlen 96 ndaddr 10</span><span style="font-size:85%;"><br /></span><span style="font-size:85%;">aufirst 0 emap 0 imap 0 iextop 0 istart 0</span><span style="font-size:85%;"><br /></span><span style="font-size:85%;"><br /></span></span><span style="font-size:100%;">I initiated both sequential and scattered reads on the tables.</span><span style="font-size:100%;"><br /></span></p><p><span style="font-size:100%;">A vxtrace showed that oracle was issuing requests for 16K or bigger sized requests and they were single IOs. They were not broken up into smaller IO requests as one would have normally expected. I could not use truss because IO requests show up as ioctl calls when using ODM. There was no read I/O smaller than 32 blocks (16K) thus confirming that IOs are not split based on filesystem blocks.</span></p><p><span style="font-size:85%;"><span style="font-family:courier new;">------------------------------------------</span><span style="font-family:courier new;"><br /></span><span style="font-family:courier new;">1254 START read vol test1k op 0 block 4326176 len 32 <strong><----- 16K Reads</strong></span><span style="font-family:courier new;"><br /></span><span style="font-family:courier new;">1254 END read vol test1k op 0 block 4326176 len 32 time 0</span><span style="font-family:courier new;"><br /></span></span></p><p><span style="font-size:85%;"><span style="font-family:courier new;">--------CUT---------</span></span></p><p><span style="font-size:85%;"><span style="font-family:courier new;">1260 START read vol test1k op 0 block 4326048 len 128 <------ </span></span><span style="font-size:85%;"><span style="font-family:courier new;"><strong>64K Reads</strong></span></span><span style="font-size:85%;"><span style="font-family:courier new;"><br /></span><span style="font-family:courier new;">1260 END read vol test1k op 0 block 4326048 len 128 time 0</span><span style="font-family:courier new;"><br /></span><span style="font-family:courier new;">1261 START read vol test1k op 0 block 4326176 len 32</span><span style="font-family:courier new;"><br /></span><span style="font-family:courier new;">1261 END read vol test1k op 0 block 4326176 len 32 time 0</span><span style="font-family:courier new;"><br /></span><span style="font-family:courier new;">1262 START read vol test1k op 0 block 4325792 len 128</span><span style="font-family:courier new;"><br /></span><span style="font-family:courier new;">1262 END read vol test1k op 0 block 4325792 len 128 time 0</span><span style="font-family:courier new;"><br /></span></span></p><p><span style="font-size:85%;"><span style="font-family:courier new;">------------CUT------------------------------</span></span><span style="font-size:85%;"><span style="font-family:courier new;"><br /></span><span style="font-family:courier new;"><br /></span></span></p><p><strong>On a 8K Filesystem block size and a 16K DB Block size</strong></p><p><span style="font-size:100%;">To confirm the block size is indeed 8k</span><span style="font-size:85%;"><br /></span></p><p><span style="font-size:85%;"><span style="font-family:courier new;">root@oracle:> fstyp -v /dev/vx/rdsk/edwrsdg/test8k<br />vxfs<br />magic a501fcf5 version 6 ctime Sat Dec 27 22:52:47 2008<br />logstart 0 logend 0<br /><strong><span style="color:#ff0000;">bsize 8192</span></strong> size 655360 dsize 0 ninode 655360 nau 0<br />defiextsize 0 ilbsize 0 immedlen 96 ndaddr 10<br />aufirst 0 emap 0 imap 0 iextop 0 istart 0<br /><br /></span></span><span style="font-size:100%;">I did the same set of reads as done for the 1k filesystem and it was the same.<br /></span></p><p><span style="font-family:courier new;"><span style="font-size:85%;">------------CUT-----------</span></span></p><p><span style="font-family:courier new;"><span style="font-size:85%;">1265 START read vol test1k op 0 block 4326048 len 128 <------ <strong>64K reads</strong><br />1265 END read vol test1k op 0 block 4326048 len 128 time 0<br />1266 START read vol test1k op 0 block 4326176 len 32 <--------- <strong>16K reads</strong><br />1266 END read vol test1k op 0 block 4326176 len 32 time 0<br />1267 START read vol test1k op 0 block 4325888 len 32<br />1267 END read vol test1k op 0 block 4325888 len 32 time 0</span></span></p><p><span style="font-size:85%;">------------CUT----------------</span><span style="font-size:100%;"><br /><br />So the reads behave exactly like how it is documented. Oracle will do reads only in multiples of db block sizes. On either a 1K or 8k Vxfs block filesystem, a 16K or multiples of 16K reads would be sequential reads of contigous blocks and hence be satisfied from within a single IO request - as long as the IO request can be met from a single extent. </span></p><p><span style="font-size:100%;">So from an IO perspective, it really does not matter if using 1K or 8K. </span></p><p><span style="font-size:100%;">Now there is other aspect to this - file system overhead, fragmentation, extent sizing and space management. </span></p><p><span style="font-size:100%;">1K filesystem block size would reduce space wastage at a cost of having to manage a lot many blocks (filesytem overhead) whereas 8K filesystem block size would be ideal for an oracle instance using a DB block size of 8K or higher. </span></p><p><span style="font-size:100%;">F</span><span style="font-size:100%;">rom a filesystem management perspective, using 8K filesystem block size makes better sense as Oracle would not ever store data in a size less than the DB Block size. An 8K filesystem block size reduces the number of blocks and correspondingly the filesystem overhead in maintaining these blocks. I do not know if anyone uses a 4K DB Block size any more. All I have seen are 8K and higher. </span></p><p><span style="font-size:100%;">To reduce fragmentation, it is best if the datafile is using a single extent (as will be when created on a database using VRTS ODM). The extent here refers to the Vxfs Extents and not Tablespace extents. To maintain as a single Vxfs extent, datafiles should never be extended and always new datafiles should be added to increase tablespace capacity.</span></p><p><span style="font-size:100%;">You can find out the extents allocated to a file by running vxstorage_stats - it is an invaluable tool. Fragmentation status can be identified by running fsadm. Normally when using ODM, fragmentation should be minimal.<br /><br /></span></p>Unknownnoreply@blogger.com2tag:blogger.com,1999:blog-1550840387946350137.post-67486333990125378592008-12-24T09:35:00.000-08:002009-01-06T13:29:27.005-08:00_realfree_heap_pagesize_hint - Assessing the impact on Solaris and LinuxThe _realfree_heap_pagesize_hint in 10g provides a mechanism by which the process private memory (PGA) can use bigger memory page sizes and thus reduce TLB/TSB misses. This parameter is set in bytes.<br /><br />This is especially important for Datawarehousing wherein a session can consume significant amount of anonymous memory and in many cases the workarea is bigger than the SGA.<br /><br /><p>I wrote about TLB/TSB misses from an oracle perspective in an earlier blog here.</p><p>http://dsstos.blogspot.com/2008/11/assessing-tlbtsb-misses-and-page-faults.html<br /></p> <br />This parameter is designed to work on the Solaris platform only, however it does work partially on Linux too and probably the same way on other platforms.<br /><br /><p>As per this hint, </p><ol><li>memory extents within the heap would be in <strong>_realfree_heap_pagesize_hint</strong> chunks.</li><li>And these chunks with the memcntl(2) call, be in <strong>_realfree_heap_pagesize_hint</strong> sized OS page (provided the pagesize is a valid choice).<br /></li></ol><p>For e.g. - An extent of 16MB would be carved upto into 4MB chunks and each 4M chunk would mapped to an individual 4M OS memory page (if the _realfree_heap_pagesize_hint = 4M).</p><p><span style="font-weight: bold;">Solaris:</span><br /></p><br />Solaris supports four page sizes on the UltraSparc IV+ platform (8K-default, 64K, 512K and 4M). The default setting for the _realfree_heap_pagesize_hint is 65536 or 64K.<br /><br />In order to test this parameter, I did a sort on a un-indexed table with approx 3.8 million rows. The avg row length was ~243 bytes and the table approx 1GB in size. The reason I selected such a big table was also to see how memory utilization changed when using different page sizes.<br /><br /><span style="font-weight: bold;">_realfree_heap_pagesize_hint at 65536 (Default)</span><br /><br />This implies that when a session requests anon memory, oracle will use 64K pages. However this did not seem to be true. With a setting of 65536, only 8K pages were used.<br /><br />I did a truss of the shadow process when doing the sort and this is what I observed.<br /><br />-----------CUT-------------------<br /><br /><span style="font-size:85%;"><span style="font-family:courier new;"><span style="font-weight: bold;">19167/1: 5.5795 mmap(0x00000000, 2097152, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE, 8, </span>3080192) = 0xFFFFFFFF7A5F0000</span><br /><span style="font-family:courier new;"><span style="font-weight: bold;">19167/1: 5.5796 mmap(0xFFFFFFFF7A5F0000, 65536, PROT_READ|PROT_WRITE,</span> MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF7A5F0000</span><br /><span style="font-family:courier new;">19167/1: 5.5813 mmap(0xFFFFFFFF7A600000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF7A600000</span><br /><span style="font-family:courier new;">19167/1: 5.5829 mmap(0xFFFFFFFF7A610000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF7A610000</span><br /><span style="font-family:courier new;">19167/1: 5.5846 mmap(0xFFFFFFFF7A620000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF7A620000</span><br /><span style="font-family:courier new;">19167/1: 5.5863 mmap(0xFFFFFFFF7A630000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF7A630000</span><br /></span><br />------------------CUT-------------------------------------<br /><br />As you can see, the extent of size 2M was requested with MAP_NORESERVE and then into 64K chunks. However there is no accompaning memcntl(2) request to ask the OS to allocate 64K pages for the chunks. This is also confirmed when using pmap/trapstat.<br /><br />trapstat not showing usage of any 64K pages.<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhVZQB3LSbg5vI7f6vn6UHM67dAkqq1PQ4k4xIZ7Lbq7crfm4_8Ga9ejwEA-PK3QECAYpHdRA71sodWj1zkAqftOufnm1r8eQFln2ASgspyUq9mczL_TFGSDqJRzX-6oxUViE7mJ7oixg3h/s1600-h/trapstat-64K.jpg"><img style="cursor: pointer; width: 400px; height: 275px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhVZQB3LSbg5vI7f6vn6UHM67dAkqq1PQ4k4xIZ7Lbq7crfm4_8Ga9ejwEA-PK3QECAYpHdRA71sodWj1zkAqftOufnm1r8eQFln2ASgspyUq9mczL_TFGSDqJRzX-6oxUViE7mJ7oixg3h/s400/trapstat-64K.jpg" alt="" id="BLOGGER_PHOTO_ID_5283431887127175906" border="0" /></a><br /><br /><br />pmap output showing anon pages using 8k page size.<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjLpg8oT4IWqV8I6hLdFSqDB_7be3VYIz1UWL8w816CjBXjTx8Llr1yiftGfKmhKFu0UJfKTu08nB7HhhE3IciNwDDeMVSg8Uk_P_hK8Zkvdyy5KsjyZ4yjaje45_FOYbuBeL6Zp-TUjHjO/s1600-h/pmap-64K.JPG"><img style="cursor: pointer; width: 400px; height: 343px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjLpg8oT4IWqV8I6hLdFSqDB_7be3VYIz1UWL8w816CjBXjTx8Llr1yiftGfKmhKFu0UJfKTu08nB7HhhE3IciNwDDeMVSg8Uk_P_hK8Zkvdyy5KsjyZ4yjaje45_FOYbuBeL6Zp-TUjHjO/s400/pmap-64K.JPG" alt="" id="BLOGGER_PHOTO_ID_5283433455403115570" border="0" /></a><br /><br /><br /><span style="font-weight: bold;">Changing the _realfree_heap_pagesize_hint to 512K</span><br /><br />Changing the hint to 512K shows that it indeed requests 512K pages from the OS.<br /><br />------------CUT-------------<br /><br /><span style=";font-family:courier new;font-size:85%;"><span style="font-weight: bold;">19277/1: 14.6646 mmap(0x00000000, 4718592, PROT_NONE, </span>MAP_PRIVATE|MAP_NORESERVE, 8, 7864320) = 0xFFFFFFFF79780000<br />19277/1: 14.6647 munmap(0xFFFFFFFF79B80000, 524288) = 0<br /><span style="font-weight: bold;">19277/1: 14.6648 mmap(0xFFFFFFFF79780000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF79780000</span><br /><span style="font-weight: bold;">19277/1: 14.6649 memcntl(0xFFFFFFFF79780000, 524288, MC_HAT_ADVISE, 0xFFFFFFFF7FFF7EC0, 0, 0) = 0</span><br />19277/1: 14.6909 mmap(0xFFFFFFFF79800000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF79800000<br />19277/1: 14.6910 memcntl(0xFFFFFFFF79800000, 524288, MC_HAT_ADVISE, 0xFFFFFFFF7FFF7F80, 0, 0) = 0<br /><br /></span>---------------CUT-----------------------<br /><br />As you can see, there is the memcntl(2) call being issued to request the OS to allocate 512K page size. This is also correlated by trapstat and pmap.<br /><br />trapstat output showing TLB/TSB misses for 512K pages.<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhCAlgRWGe8Ciet7fpQFdB2GA1se0qgNb1Z25qJv-k7g_tLxTXwxf5ksTkF4XyKotp3d1YgVuQh1WjT7hMBX3NlWcHyQl4zWXwYrPH6x6Ri-TpBEkG04mrgwo5vSGk2mzGAhncl3hSV4un9/s1600-h/trapstat-512K.JPG"><img style="cursor: pointer; width: 400px; height: 274px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhCAlgRWGe8Ciet7fpQFdB2GA1se0qgNb1Z25qJv-k7g_tLxTXwxf5ksTkF4XyKotp3d1YgVuQh1WjT7hMBX3NlWcHyQl4zWXwYrPH6x6Ri-TpBEkG04mrgwo5vSGk2mzGAhncl3hSV4un9/s400/trapstat-512K.JPG" alt="" id="BLOGGER_PHOTO_ID_5283437001020682290" border="0" /></a><br /><br /><br />pmap output for anon pages showing 512K pages being used.<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgnQv5Q53SKkrCo6gpbznCUUVm8pJH5SxQyn_mE3ctYEciiDx9sGCTYfvFerXWc-buBxKBnd7rh_7tW59hz8SaphcFLcnRteaUwt-47BwrnGJrwtZJQ_g5R8P-H5wGlJCNsR3UcFPPVZbbv/s1600-h/pmap-512K.JPG"><img style="cursor: pointer; width: 400px; height: 243px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgnQv5Q53SKkrCo6gpbznCUUVm8pJH5SxQyn_mE3ctYEciiDx9sGCTYfvFerXWc-buBxKBnd7rh_7tW59hz8SaphcFLcnRteaUwt-47BwrnGJrwtZJQ_g5R8P-H5wGlJCNsR3UcFPPVZbbv/s400/pmap-512K.JPG" alt="" id="BLOGGER_PHOTO_ID_5283437627750184482" border="0" /></a><br /><br /><br /><span style="font-weight: bold;">Changing the _realfree_heap_pagesize_hint to 4M</span><br /><br />Changing the hint to 4M also shows that the pagesize being requested is 4M.<br /><br />Truss output -<br /><br /><p>-------------------CUT-----------------------</p><p><span style="font-size:85%;"><span style="font-family:courier new;"><strong>18995/1: 34.0445 mmap(0x00000000, 20971520, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE, 8, 390070272) = 0xFFFFFFFF53000000</strong><br />18995/1: 34.0447 munmap(0xFFFFFFFF54000000, 4194304) = 0<br /></span></span><strong><span style="font-size:85%;"><span style="font-family:courier new;">18995/1: 34.0448 mmap(0xFFFFFFFF53000000, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF53000000<br />18995/1: 34.0449 memcntl(0xFFFFFFFF53000000, 4194304, MC_HAT_ADVISE, 0xFFFFFFFF7FFF7EE0, 0, 0) = 0</span></span><br /><br /></strong>-----------------CUT-------------------------<br /></p><br />Trapstat output confirming usage of 4M pages for anon memory<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEikqsf9orv4NKdQyvVBy001yrwwBzXFAlnPnHypesq3yrkBRLj84176DsP5FG3bhmP9-iC7ZhIbbL5mrtk_u0JZ1fjdUVukKoqbgUuBkMsEnOfSgugZM_OEsp1KAzkhla4hFadjyE12UnkQ/s1600-h/trapstat-4M.JPG"><img style="cursor: pointer; width: 400px; height: 271px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEikqsf9orv4NKdQyvVBy001yrwwBzXFAlnPnHypesq3yrkBRLj84176DsP5FG3bhmP9-iC7ZhIbbL5mrtk_u0JZ1fjdUVukKoqbgUuBkMsEnOfSgugZM_OEsp1KAzkhla4hFadjyE12UnkQ/s400/trapstat-4M.JPG" alt="" id="BLOGGER_PHOTO_ID_5283439025914786130" border="0" /></a><br /><br />And finally pmap output.<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghyphenhyphenbHG8Tvj4aI8eRrQzNbj4RHWKaoaInJ1FMNqiNFUtcKza-YTh1jjGsnOdYYFEn52aYgsZXRtYedu8uYx0DmVy1TFp1MZNZQLdjOEnlHqTuSV80WxmiOjBVNIVT0aSe5f4rCF-YesSbS5/s1600-h/pmap-4M.JPG"><img style="cursor: pointer; width: 400px; height: 243px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghyphenhyphenbHG8Tvj4aI8eRrQzNbj4RHWKaoaInJ1FMNqiNFUtcKza-YTh1jjGsnOdYYFEn52aYgsZXRtYedu8uYx0DmVy1TFp1MZNZQLdjOEnlHqTuSV80WxmiOjBVNIVT0aSe5f4rCF-YesSbS5/s400/pmap-4M.JPG" alt="" id="BLOGGER_PHOTO_ID_5283439475362301778" border="0" /></a><br /><br /><br /><span style="font-weight: bold;">So we know now that this does work as expected except for the default setting of 64K. So how does this affect performance?</span><br /><br /><ol style="font-weight: bold;"><li>By using bigger page sizes, we can store more virtual to physical entries in the TLB/TSB and reduce TLB/TSB misses.</li><li>Also if using bigger page sizes, it results in a reduction in the number of mmap requests thus reducing CPU spent on system time. For e.g - a 4M extent would require 512 mmap requests if using the default 8K pages, but only 1 mmap request if using a 4M page size.</li><li>So memory requests can be addressed significantly faster if using bigger page sizes.</li><li>However with bigger pages, one would expect that memory utilization will also go up. The basic denominator for memory requests being page sizes (8K, 512K or 4M), it is possible that there will be memory wastage.<br /></li></ol><br />In order to check for memory wastage, I checked the v$sql_workarea_active along with session pga/uga stats to identify how much memory was consumed with different page size setting. By sizing the PGA and using _smm_max_size appropriately, I ensured that the sort completes optimally and in memory without spilling to disk.<br /><br /><span style="font-weight: bold;">With the default setting of 64K</span><br /><br />Time taken to complete - 30-32 seconds<br />Workarea Memory used - 1085.010 MB<br />session pga memory - 1102.92 MB<br />session uga memory - 1102.3 MB<br /><br /><span style="font-weight: bold;">With 512K</span><br />Time taken to complete - 24-28 seconds<br />Workarea Memory used - 1085.010 MB<br />session pga memory - 1103.73 MB<br />session uga memory - 1102.2 MB<br /><br /><span style="font-weight: bold;">With 4M</span><br /><br />Time taken to complete - 24-27 seconds<br />Workarea Memory used - 1085.010 MB<br />session pga memory - 1112.2 MB<br />session uga memory - 1103.99 MB<br /><br />Looking at the above stats, for the same sort operation requiring 1GB of workarea, the PGA usage is a fraction higher (~1%) with bigger page sizes. This may impact very big sorts or when multiple sessions running simultaneously - especially when doing parallel operations, so there is always the chance that you may end up with ora-4030 errors if you do not configure your instance appropriately.<br /><br />Theoretically the timings should improve because of the lesser number of mmap operations and also reduced TLB/TSB misses. All in all, it probably makes sense to use this feature to enable bigger page sizes for Datawarehousing.<br /><br /><span style="font-weight: bold;">On Linux</span><br /><br />On Solaris the _realfree_heap_pagesize_hint works well since four different Page sizes (8K, 64K, 512K and 4M) are supported and can be allocated dynamically. However on Linux, only two page sizes are supported (4K and 2M). The 2M pagesize can be allocated only as huge-pages which is used for the SGA. Huge-pages cannot be used for private process memory.<br /><br />So in Linux, setting the _realfree_heap_pagesize_hint to bigger values only results in _realfree_heap_pagesize_hint sized chunks within extents, however not mapped to physical memory pages of the same size. Since this reduces the number of mmap requests and thus is better than the default.<br /><br /><span style="font-weight: bold;">With the default setting of 64K</span><br /><br />------------CUT-------------<br /><span style="font-size:85%;"><br /><span style="font-weight: bold;font-family:courier new;">mmap2(NULL, 1048576, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE, 7, 0xf1) = 0xb70f1000</span><br /><span style="font-family:courier new;"><span style="font-weight: bold;">mmap2(0xb70f1000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = </span>0xb70f1000</span><br /><span style="font-family:courier new;">mmap2(0xb7101000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb7101000</span><br /></span><br />-------CUT-----------<br /><br />As you can see from above, 64K chunks are requested.<br /><br /><span style="font-weight: bold;">Changing to 4M<br /></span><br />----------CUT-----------<br /><span style="font-size:85%;"><span style="font-weight: bold;font-family:courier new;">mmap2(NULL, 16777216, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE, 7, 0x36f1) = 0xb2af1000</span><br /><span style="font-weight: bold;font-family:courier new;">mmap2(0xb2af1000, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb2af1000</span><br /><span style="font-family:courier new;">mmap2(0xb2ef1000, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb2ef1000</span><br /><span style="font-family:courier new;">mmap2(0xb32f1000, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb32f1000</span></span><br />---------CUT---------<br />As you can see from the above, with a setting of 4M, the chunks are 4M sized, however there is no request for a 4M page size as this is not feasible in Linux.<br /><span style="font-weight: bold;"><span style="font-weight: bold;"><br />Changing to 8M<br /><br /></span></span>I was curious to see how this would play out when changing to 8M.<br /><br />--------CUT-------------<br /><span style="font-size:85%;"><span style="font-family:courier new;">mmap2(NULL, 16777216, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE, 7, 0x5af1) = 0xb02f1000</span><br /><span style="font-family:courier new;">mmap2(0xb02f1000, 8388608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb02f1000</span><br /><span style="font-family:courier new;">mmap2(0xb0af1000, 8388608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb0af1000</span></span><br />--------CUT--------------<br /><br />The chunks are now 8M in size. I noticed the same behavior in Solaris too (minus memcntl to request an appropriate OS page size).<br /><br /><br /><span style="font-weight: bold;"><span style="font-weight: bold;"><br /></span><br /></span>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1550840387946350137.post-59879349186882312702008-11-06T11:23:00.000-08:002008-12-06T13:58:35.620-08:00Assessing TLB/TSB Misses and Page Faults<a title="View TLB Misses and Page Faultsv3 document on Scribd" href="http://www.scribd.com/doc/8701498/TLB-Misses-and-Page-Faultsv3" style="margin: 12px auto 6px auto; font-family: Helvetica,Arial,Sans-serif; font-style: normal; font-variant: normal; font-weight: normal; font-size: 14px; line-height: normal; font-size-adjust: none; font-stretch: normal; -x-system-font: none; display: block; text-decoration: underline;">TLB Misses and Page Faultsv3</a> <object codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=9,0,0,0" id="doc_696101702210070" name="doc_696101702210070" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" align="middle" height="600" width="600"> <param name="movie" value="http://documents.scribd.com/ScribdViewer.swf?document_id=8701498&access_key=key-1db40de797dq8bqcg1md&page=1&version=1&viewMode=list"> <param name="quality" value="high"> <param name="play" value="true"> <param name="loop" value="true"> <param name="scale" value="showall"> <param name="wmode" value="opaque"> <param name="devicefont" value="false"> <param name="bgcolor" value="#ffffff"> <param name="menu" value="true"> <param name="allowFullScreen" value="true"> <param name="allowScriptAccess" value="always"> <param name="salign" value=""> <param name="mode" value="list"> <embed src="http://documents.scribd.com/ScribdViewer.swf?document_id=8701498&access_key=key-1db40de797dq8bqcg1md&page=1&version=1&viewMode=list" quality="high" pluginspage="http://www.macromedia.com/go/getflashplayer" play="true" loop="true" scale="showall" wmode="opaque" devicefont="false" bgcolor="#ffffff" name="doc_696101702210070_object" menu="true" allowfullscreen="true" allowscriptaccess="always" salign="" type="application/x-shockwave-flash" align="middle" mode="list" height="600" width="600"></embed> </object> <div style="margin: 6px auto 3px auto; font-family: Helvetica,Arial,Sans-serif; font-style: normal; font-variant: normal; font-weight: normal; font-size: 12px; line-height: normal; font-size-adjust: none; font-stretch: normal; -x-system-font: none; display: block;"> <a href="http://www.scribd.com/upload" style="text-decoration: underline;">Publish at Scribd</a> or <a href="http://www.scribd.com/browse" style="text-decoration: underline;">explore</a> others: </div><div style="display:none"> <br />TLB, TSB Misses and Page Faults I was trying to explain to my team mates about TLB misses and Page faults when I realized I was not 100% confident about it myself. I spent some time reading up Solaris Internals on this topic. I also wrote to our Sun contact and got first hand information from Sun. The below is a rather simplified description of TLB/TSB misses and Page faults. Basics Memory is divided into page sized chunks. Supported Page sizes depend on hardware platform and the Operating System. The current UltraSparc platform running Solaris 10 supports 8K, 64K, 512K and 4M pages. The cool threads servers (T2000 and newer versions) running Solaris 10 supports 256M page sizes also (512K is not supported). The Terminology - TLB, TSB , HPT, HME and TTE When a process requests memory, only virtual memory is allocated. Physical memory is not allocated yet. The first time a process requests access to a page within the allocated virtual memory, a page fault occurs. As a result, a physical page (from free lists) is then mapped to the virtual page of the process. This mapping is created by the virtual memory system in software and stored in the HPT (Hash Page Tables) in the form of HAT Mapping Entries (HME). Also a copy of the entry is inserted into the TLB and the TSB as Translation Table Entries (TTE). The TLB or Translation Lookaside Buffer is a cache of the most recently used virtual to physical memory mappings or Translation Table Entries (TTE) on the CPU. There are multiple TLBs on the CPU. There is the iTLB used to store entries for text/library and the dTLB used to store entries for data (heap/stack). The number of entries in either TLB is limited and dependent on the CPU. For example, on the UltraSparc IV+ CPU, there is the iTLB which can store 512 entries. There are 2 dTLBs, each of which can store 512 entries. Since the number of entries in the TLB is limited, there is a bigger cache of the TTEs in physical RAM called the TSB (Translation Storage Buffer). Each process has its own dedicated TSB. The default and maximum size (up to 1MB/user process) that a user process TSB can grow to, can be changed in Solaris 10. The TSB grows and shrinks as need be and each process has 2 TSBs – one for 8K, 64K and 512K pages and the other for 4M pages. The maximum memory that can be allocated to all the user TSB can also be specified. And finally an entry in the TSB requires 16 bytes. So it is easy to identify the size of the TSB to hold a specified number of entries. Page Faults The CPU first checks the TLB for the TTE and if not found (TLB Miss), checks the TSB. If not present in the TSB (TSB Miss), then it checks the HPT for the HME. If not present in the HPT, it results in a Page Fault. A Minor page fault happens when the HME is not present in the HPT, however the contents of the requested page are in physical memory. The mappings need to be re-established in the HPT and the TSB and TLB reloaded with the entries. A Major page fault happens when the HME is not present in the HPT and the contents of the requested page are paged out to the swap device. The requested page needs to be mapped back into a free page in physical memory and the contents copied from swap into the physical memory page. The entries are stored in the HPT and the TSB and TLB are reloaded again with the entries. Swap and Page in/Page out Each physical memory page has a backing store identified by a file and offset. Page outs occur when the physical page contents are migrated to the backing store and Page-in is the reverse. Anonymous memory (heap and stack) use swap as the backing store. For file caching, Solaris uses the file on disk itself as the backing store. Swap is a combination of the swap device (on disk) and free physical memory. Why and when do I need to worry about TLB/TSB misses and Page Faults? As RAM gets cheaper, it is common place to see entry level systems with 16GB of memory or more as a starting point. This is for both X-86 and proprietary Unix Systems. With more available physical memory, a DBA configures oracle with bigger SGA and PGA sizes to take advantage of the available physical memory. While the above discussion is focused entirely on the Sparc platform, the concept of pages, TLB and page tables is present for all systems. If using 8K pages (Solaris) and there is 16GB of memory, then one would require ~ 2 million mappings to address the entire physical memory. If using 4K pages (Linux), then the number of mappings would be ~4 million. For maximum efficiency, relevant entries must be accessible to the CPU with minimal delay – in TLB preferably or at worst in the TSB. However, we know the number of entries the TLB can hold is limited by hardware. The TSB for a single user process (in Solaris 10 only) can be grown to a max of 1MB (65,536 entries), so it is limited too. It would not make sense to search the HPT for every TLB/TSB miss as it costs CPU cycles to search the hash mappings for required entries. And we must avoid page faults as much as possible. From an oracle perspective, if CPU wait is one of your top waits and you have ruled out other issues such as available CPUs, CPU scheduling etc and you are seeing significant increase in page faults then it probably makes sense to look deeper into TLB/TSB misses. As always, it pays to work on improving an area which can potentially deliver the biggest impact to customer experience. From my experience, the impact of TLB/TSB misses on an oracle instance can be over emphasized (Solaris Platforms) at times. So you would be the best judge to identify if this requires further analysis. What do I need to measure? Okay, so we get the idea that more RAM and bigger memory working sizes means more mappings and it is not possible to cache all the entries in TLB/TSB. So it is inevitable that there are going to be TLB/TSB misses and possibly page faults. But how do I put a price to it? How costly is a miss? How much time is spent on servicing these misses? The answer lies in using trapstat to check the % of time spent by the CPU in servicing TLB/TSB misses. Unfortunately the tool does not give an estimate of the time spent on servicing major/minor faults. To identify the number of page faults, one uses vmstat or kstat. How do I measure and analyze the impact? Running trapstat –T will show the TLB/TSB miss with the appropriate page sizes. Trapstat needs to be run as root. As you can see below, it shows the %time spent in user mode (u) and kernel mode (k). It shows both TLB and TSB misses in a page size breakdown. cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim ----------+-------------------------------+-------------------------------+---0u 0u 8k| 64k| 0 0 0 0 0.0 0.0 0.0 0.0 0 0 0 0 0.0 | 0.0 | 0.0 | 0.0 | 1 0 0 0 0.0 0.0 0.0 0.0 0 0 0 0 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 0 u 512k| 0u 4m| -----+---------------+---------------+-0k 0k 8k| 64k| 0 0 0 0 0.0 0.0 0.0 0.0 0 0 0 0 0.0 | 0.0 | 0.0 | 0.0 | 146 0 0 0 0.0 0.0 0.0 0.0 3 0 0 0 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 0 k 512k| 0k 4m| ==========+===============================+===============================+==== ttl | 619 0.0 0 0.0 | 4137 0.0 300 0.0 | 0.0 The last line gives the overall statistics for all the CPUs. If you are seeing around 20% or more time (%tim) spent on servicing TLB/TSB misses, then it probably makes sense to revisit your page sizing for your instance. Page Faults can be observed through vmstat (minor), vmstat –s (major and minor) and kstat (major and minor). The stats from vmstat –s and kstat (reports/CPU) are cumulative in nature. mkrishna@OCPD:> vmstat 1 kthr rbw memory swap free re page mf disk faults in 559 517 sy 794 767 cpu cs 811 745 us sy id 7 7 1 92 1 92 pi po fr de sr s0 s1 s6 s9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 41966192 34216352 0 5063 0 0 0 0 41925744 34175896 0 4995 0 mkrishna@OCPD:> vmstat -s 0 micro (hat) faults 2377024933 minor (as) faults 16504390 major faults mkrishna@OCPD:> kstat |egrep 'as_fault|maj_fault' as_fault maj_fault 142199182 984358 A dTSB miss results in a search for the entry in the HPT for the relevant HME. If not found in the HPT, then it results in a page fault. So perhaps a % of the time spent on dTSB miss can be assumed to be spent on servicing page faults (minor and major)? I do not know for sure and could not find out from Sun either. Since there will always be page faults when a virtual memory page is accessed for the first time, we cannot eliminate it completely. By definition, major page faults are bad, minor page faults are better than major page faults, but still need to be avoided. Ideally minor faults should be far greater than major faults. In well configured environments I have seen the ratio of major/minor faults to be < 0.5%. Major faults can occur when there is a memory shortage and heavy page out/swap outs. I have also seen a higher number of major faults when there is extensive file system data caching or double buffering happening on Oracle databases. How do I reduce TLB/TSB misses and Page Faults from an Oracle perspective? Theoretically, to reduce the incidence of TLB/TSB misses and page faults, one would use bigger sized pages to reduce the number of entries required to map a segment and use an optimally sized TSB to prevent TSB misses (TLB being fixed in size). This is assuming that you have configured the instance correctly to fit within the available physical memory. The below would be a practical way to implement it. 1. Reduce thread migrations (Harden affinity to CPUs) - Thread affinity will ensure a thread is executed on the same CPU as before. This will improve chances that the entries for the running thread are already present in the TLB on the CPU. Thread migrations can be seen using mpstat (migr column). Thread affinity is set as system parameter – rechoose_interval. The default value for rechoose_interval is 3. For a Datawarehouse system, I normally set it to 150. 2. Oracle Shared Memory - Oracle uses shared memory (SGA) and private anonymous memory (PGA). On Solaris, Oracle uses ISM for shared memory. ISM along with other benefits enables use of 4M pages and so already uses biggest possible page size on the UltraSparc IV+ platform running Solaris 10. Also for processes sharing the same segments, TSB is shared. So by default, when using ISM for the SGA, Oracle is already well optimized for minimal TLB/TSB misses. For the cool threads platform (Solaris 10), a mix of 256M and 4M Page sizes is used for ISM segments and so is even better optimized. 3. Oracle PGA - For your PGA or private memory, the page size setting is controlled by the parameter _realfree_heap_pagesize_hint (10g). The default value is set to 64K and so should use a 64K page size. However, it does not seem to be so. I have observed that when set at 64K, it uses 8K pages only. However setting it to 512K or 4M does indeed change the page size for PGA usage to 512K or 4M. Setting this parameter results in memory being allocated in realfree_heap_pagesize_hint sized chunks (64K, 512K, 4M) and so can potentially result in wastage of memory and starve other sessions/applications of physical memory. Setting this to 512K/4M also reduces your page faults considerably. 4. TSB Configuration - Increase the size of default startup TSB (Solaris 10) to prevent TSB misses. 1 entry in the TSB requires 16 bytes. So depending on your memory allocation to the SGA and PGA, you can set the default TSB size accordingly. Each process can have up to 2TSB with one of the TSB being dedicated to service 4M Page entries. There are several configuration parameters that can be set in the /etc/system. a. default_tsb_size – The default value is 0 (8KB). 8KB will hold 512 entries. For Oracle, you have to consider both PGA and SGA usage. Let us assume you have configured 12GB for your SGA (using ISM with 4M pages as default) and 6GB PGA (using 4M page size). 12GB of SGA would require 3072 entries or 48KB TSB. 6GB of PGA would result in global memory bound of ~700MB (serial operations 175 pages of 4M each) or ~2100 MB (parallel operations – 525 pages of 4M each). So for this case, a default_tsb_size of 8K would be too small and get resized frequently. A default size of 32KB (default_tsb_size = 2) which can then grow accordingly (to a max of 1M) would be preferable. The problem with having bigger default sizes is that it consumes physical memory, which is however capped by the tsb_alloc_hiwater_factor. b. tsb_alloc_hiwater_factor – Default is 32. This setting ensures that total TSB usage on the system for user processes does not exceed 1/32 of physical memory. So if you have 32GB of memory, then total TSB usage will be capped at 1GB. If you have lots of memory to spare and expect a high number of long lived sessions connecting to the instance, then this can be reduced. c. tsb_rss_factor – Default is 384. Value of tsb_rss_factor/512 is the threshold beyond which the tsb is resized. The default setting is 75% (384/512). It probably makes sense to reduce this to 308 so that at 60% utilization of the TSB, it will get resized. d. tsb_sectsb_threshold – In Solaris 10, each process can have up to 2TSB – one for 8K, 64K and 512K pages and one for 4M pages. This setting controls the number of 4M mappings the process must have before the second TSB for 4M pages is initialized. It varies by the CPU. For a UltraSparc IV, the default is 8 pages. 5. To reduce page faults from user sessions, change _realfree_heap_pagesize_hint from 64K to either 512K or 4M. Also use ODM or Direct i/o. Avoid file system buffering for oracle data files. 6. Also ensure that the memory requirements of oracle can be met entirely within the physical memory. </div>Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-1550840387946350137.post-2801599357468505772008-09-30T12:29:00.001-07:002008-09-30T12:29:59.266-07:00Redo configuration for a High Transaction EnvironmentWe resolved a number of LGWR issues and the steps we took are detailed in the presentation.<br /><br /><object codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=9,0,0,0" id="doc_608352153097130" name="doc_608352153097130" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" align="middle" height="600" width="600"> <param name="movie" value="http://documents.scribd.com/ScribdViewer.swf?document_id=6320443&access_key=key-1n8e2l4oea8khi6a2lsf&page=1&version=1&viewMode=list"> <param name="quality" value="high"> <param name="play" value="true"> <param name="loop" value="true"> <param name="scale" value="showall"> <param name="wmode" value="opaque"> <param name="devicefont" value="false"> <param name="bgcolor" value="#ffffff"> <param name="menu" value="true"> <param name="allowFullScreen" value="true"> <param name="allowScriptAccess" value="always"> <param name="salign" value=""> <param name="mode" value="list"> <embed src="http://documents.scribd.com/ScribdViewer.swf?document_id=6320443&access_key=key-1n8e2l4oea8khi6a2lsf&page=1&version=1&viewMode=list" quality="high" pluginspage="http://www.macromedia.com/go/getflashplayer" play="true" loop="true" scale="showall" wmode="opaque" devicefont="false" bgcolor="#ffffff" name="doc_608352153097130_object" menu="true" allowfullscreen="true" allowscriptaccess="always" salign="" type="application/x-shockwave-flash" align="middle" mode="list" height="600" width="600"></embed> </object><div style="font-size:10px;text-align:center;width:600"><a href="http://www.scribd.com/doc/6320443/Redo-Configuration-in-a-Real-World-Environment">Redo Configuration in a Real World Environment</a> - <a href="http://www.scribd.com/upload">Upload a Document to Scribd</a></div><div style="display:none"> <br />Redo Configuration for Sustained and Scalable Performance in a Real World Environment Krishna Manoharan krishmanoh@gmail.com 1 Background During routine capacity planning of an Operational Data Store, it was noticed that large extracts were not scaling as predicted. Among other things, higher log related waits were noticed when conducting the analysis. A study was performed to reduce/eliminate log related waits. 2 The Environment Oracle 10.2.0.3 Enterprise Edition 64bit on Solaris 9 Database Size – 4 TB SunFire E4800 with 12 CPU’s, 48GB RAM and 2 IO Boards. Work Load – Numerous small transactions due to replication and large batch process through ETL. Operations in logging mode. Commits done for every row change (replication) to commits after million+ rows changes (ETL). 3 Work Load Profile (redo) of the instance (Peak Hours) # Session Session Session Session LGWR LGWR LGWR Stat commit wait requested commit wait performed redo entries generated redo size redo synch writes redo writes lgwr throughput Value 72/sec 72/sec 13326 entries/sec 9.45MB/sec 72 sync writes/sec 74 writes/sec 9.8 MB/sec 4 Objectives of the study Deliver predicted scalability by Reduce/Eliminate log related waits. Improve on log related statistics. Eliminate LGWR/ARCH as a bottleneck. Performance as measured by Improved transaction rates. Meeting pre-defined thresholds for waits and stats. Build a standard for an optimal and scalable redo/archive configuration. Performance tuning measure – not capacity planning. Limited to configuration only, no code changes. 5 Symptoms from Oracle Top log related wait statistics (Peak Hours) Existing (average wait time) Negligible Negligible 28.6 ms 11.8 ms 7.02 ms 20.82 ms 26.45 ms 0.38 ms Threshold (average wait time) Eliminate Eliminate Eliminate < 2ms < 10ms ? < 5 ms Eliminate Event latch: redo allocation latch: redo writing log buffer space log file parallel write log file sequential read log file switch completion log file sync log file switch (checkpoint incomplete) Waits/sec Negligible Negligible 3.5 waits/sec 74 waits/sec 10 waits/sec 0.5 waits/sec 72 waits/sec 0.5 waits/sec 6 Symptoms from Oracle – contd. Top log related instance statistics (During peak hours) # LGWR LGWR Session Session Session Statistic Existing Threshold redo sync response time 0.27 ms/sync write < 0.1 ms redo write response time 0.12ms/write < 0.1 ms/write redo buffer allocation retries 0.0002 retries/entry, 3.8 retries/second Eliminate redo log space requests 0.000081/entry, 1 request/second Eliminate redo log space wait time 0.12ms/wait for space < 0.1 ms/wait for space 7 Symptoms from the System – contd. Top system related statistics (During peak hours) # CPU CPU Parameter Average Run Queue Average Involuntary context switches for LGWR Before 1.2 30% 8ms (Reads) 12ms (Write) 8ms (Reads) 12ms (Write) 8ms (Reads) 12ms (Write) Threshold No Change Eliminate 10ms (Reads) 2ms (Write) 10ms (Reads) 2ms (Write) 10ms (Reads) 2ms (Write) Storage Average Redo Lun response time Storage Average File response time (redo logs) Storage Average Redo Volume Response Time 8 Existing Log Configuration (Instance) No _ parameters set. log_buffer – Default (Seen as 4MB) Redo log groups – 3 Size of members – 500M log_archive_max_processes – Default (seen as 2) Using VRTS ODM and Vxfs 9 Existing System/Storage Configuration Default scheduling class – TS and Default priorities for Oracle. Thread Affinity set to 150. Storage Foundation 4.1 MP2 for Oracle Enterprise Edition. Maxphys set to 8M (system and Vxvm) Lun Queue Depth – 30 with a max of 256/Target. All luns – Raid 1 using 72GB, 15K RPM FC Drives. Storage – Hitachi AMS1000 Dual Simultaneous Active (2Gbit) Paths to each lun. Load Balancing via vxdmp. 10 Existing Physical Log Configuration (Filesystem) Redo Members LOG01A.dbf - 500M (Primary of Group 1) LOG02A.dbf - 500M (Primary of Group 2) LOG03A.dbf - 500M (Primary of Group 3) Redo Members LOG01B.dbf - 500M (Mirror of Group 1) LOG02B.dbf - 500M (Mirror of Group 2) LOG03B.dbf - 500M (Mirror of Group 3) Filesystem 1 (vxfs) /u05/redo1 Filesystem 2 (vxfs) /u05/redo2 Raid 1 Single Lun of 66.4GB } 72GB, 15K RPM FC 72GB, 15K RPM FC 11 Log Waits - Schematic User Session Event Log File Parallel Write CPU Subsystem Event Log File Switch Completion Log Group 1 Redo Logs Log Group 2 Shared Log Buffer Private Redo Strands Archive Log Files SGA CPU Subsystem Event Log Archive IO ARCH Event Log File Sequential Read } LGWR Event Log File Sync Event Log Buffer Space IO Subsystem 12 Analysis of the symptoms LGWR related Wait - log file parallel write - “Writing redo records to the redo log files from the log buffer. The wait time is the time it takes for the I/Os to complete.” High average wait time (11.8 ms). Correlating Stats High Lun response time for the redo log filesystem. High redo sync time (0.27 ms/sync write). High redo write time (0.12 ms/write). Log redo sync writes (72 writes/sec). Higher buffer allocation retries (3.8 retries/sec). High degree of involuntary context switches for the LGWR process. 13 Analysis of the symptoms (contd.) Wait - log file parallel write (contd.) With a high degree of commits, it is more important to review the average response time/wait rather than the number of waits. Every commit write wait results in a increment to the log file parallel write event. The physical layout of the redo log filesystems show a single lun used for all the groups. Since redo log members are relatively small sized, it is common practice for the Storage/System Admin to assign a single lun which is then used for all the redo filesystems. This invariably leads to poor IO performance and a slower LGWR. A slower LGWR also results in poor commit performance (sync writes) as evidenced by correlating stats. A slower LGWR results in higher buffer allocation retries because the LGWR is unable to write the redo entries and flush the buffer to the disk fast enough to meet the session requirements. The overall run queue on the system was low, however involuntary context switching (~30%) indicated that LGWR was being switched out of the CPU before it could complete it’s task. The high lun response time for the redo log filesystem’s indicated that IO was a bottleneck. 14 Analysis of the symptoms (contd.) LGWR related Wait - log file switch completion – “Waiting for a log switch to complete” Wait - log file switch (checkpoint incomplete) – “Waiting for a log switch because the session cannot wrap into the next log. Wrapping cannot be performed because the checkpoint for that log has not completed” Large number of waits (0.5 waits/sec) with high average wait time (20.82 ms). Correlating Stats redo log space requests – 1 request/sec redo log space wait time – 0.12ms/entry 15 Analysis of the symptoms (contd.) Wait - log file switch completion (contd.) During a log file switch, redo generation is disabled. So this wait directly impacts session performance. The log members were only 500M in size and thus causing frequent log switches (every 1 minute). This will result in higher waits. The log_buffer is 4M in size and during a log switch, the log buffer is flushed to disk. If there is an IO bottleneck to the redo log files, then flushing 4M of log buffer could result in higher response times. Since the redo log groups were on the same set of Filesystems, there could possibly be a IO conflict between the checkpoint and LGWR processes when doing a log switch as shown in the wait log file switch (checkpoint incomplete). However a bigger log file can also cause slower log file switches. The impact of increasing the log member size needs to be studied with respect to the event – log file switch completion. 16 Analysis of the symptoms (contd.) ARCH related Wait - log file sequential read – “Waiting for the read from this logfile to return. This is used to read redo records from the log file – either for recovery or archival. The wait time is the time it takes to complete the physical I/O (read)” High number of waits (10 waits/sec) with high average wait time (7 ms). Correlating Stats High Lun response time for the redo log filesystem. Event – log file parallel write (high average wait time – 11.8 ms) 17 Analysis of the symptoms (contd.) Wait - log file sequential read (contd.) Small sized redo log members cause frequent log switches (1/minute). These logs need to be archived and thus indirectly impacts the event log file sequential read. Members of the redo groups were located on the same filesystems and share the same physical LUNS. This results in IO contention because the ARCH process is reading from the previous group as the LGWR is writing to the present group . This in turn impacts LGWR write performance thus resulting in increased response time for the events - log file parallel writes and log file sync waits. Poor archival performance can also indirectly impact log switches as reported in the event – log file switch (archival incomplete) and thus session performance. For 500 M log members, the average response time is on the higher side again indicating an IO contention. Since redo log members are relatively small sized, it is common practice for the Storage/System Admin to assign a single lun which is then used for all the redo filesystems. The nature of the access being sequential, this problem is multiplied in effect – especially if the lun is Raid 5. Increasing log file sizes can also cause this event to report higher wait times. 18 Analysis of the symptoms (contd.) Session related Wait - log buffer space – “Waiting for space in the log buffer because the session is writing data into the log buffer faster than LGWR can write it out” High number of log buffer space waits (3.5 waits/sec) with an average response time of 28.6 ms. Correlating Stats Event – log file parallel write (high average wait time) redo buffer allocation retries (3.8 retries/sec). 19 Analysis of the symptoms (contd.) Wait - log buffer space (contd.) This along with the high response time for the log file parallel writes wait shows a slow LGWR. The presence of higher redo log buffer allocation retries also correlate this wait. It also can mean that the default log buffer (4MB) is too small for the rate of redo generation (9.45 MB/sec). During a log switch, LGWR flushes the log_buffer to disk. So the impact of increasing the size of the log_buffer needs to be analyzed with respect to the event – log file switch completion. 20 Analysis of the symptoms (contd.) Session related Wait - log file sync – “When a user session commits, the session's redo information needs to be flushed to the redo logfile. The user session will post the LGWR to write the log buffer to the redo log file. When the LGWR has finished posting, it will post the user session. The wait time includes the writing of the log buffer and the post.” The average wait time was 26.45 ms. Correlating Stats Event – log file parallel write (high average wait time – 11.8 ms) High degree of involuntary context switches for both user session and LGWR. 21 Analysis of the symptoms (contd.) Wait - log file sync (contd.) Every commit write wait/immediate will result in an increment of the wait counter and a redo write (resulting in an increment to the log file parallel write wait counter). Rather than the number of waits, the average wait time is important for this wait event. Under ideal circumstances, the average wait time for a log file sync event should be the about the same as the average wait time for the wait – log file parallel write. If there is a difference, then it probably indicates a CPU bottleneck for the session. Higher wait times can be a result of slow LGWR as well as CPU bottleneck (evidenced by high involuntary context switches for session processes) 22 Initial Conclusions From the waits and stats, we came to following conclusions LGWR The underlying IO subsystem for the redo logs needed to be improved. The redo log members needed to be resized from 500M to a suitable size. Also increase the groups from 3 to 4. Reduce LGWR involuntary switches by addressing OS scheduling issues. 23 Initial Conclusions (contd.) ARCH Separate the redo log groups onto dedicated filesystems to prevent contention between ARCH and LGWR. Session Increase log buffer from the default to a suitable value taking into consideration impact on the event log file switch completion. 24 Final Configuration Details After 30 or so runs, we finally arrived at the below optimal configuration. Redo Filesystem configuration (to address IO issues) Striped filesystems on dedicated Raid 1 luns configured for the redo logs as shown in the next slide. Filesystem is vxfs with 8k block size. Stripe Width = 1M LGWR configuration (to address involuntary context switches) The FX scheduling class was set for the LGWR process. The CPU time quantum was increased to 1000 and the priority set to 59. # priocntl -s -c FX -m 59 -p 59 -t 1000 -i pid <LGWR process> The thread affinity was set to 150 for the entire system, however we decided it to be best if the LGWR was bound to a specific CPU. # psrset –c <CPU> # psrset –b 1 <LGWR process> 25 New Filesystem Layout and redo group placement Redo Members LOG01A.dbf - 1500M (Primary of Group 1) LOG03A.dbf -1 500M (Primary of Group 3) Redo Members LOG02B.dbf - 1500M (Mirror of Group 2) LOG04B.dbf -1 500M (Mirror of Group 4) Redo Members LOG02A.dbf - 1500M (Primary of Group 2) LOG04A.dbf -1 500M (Primary of Group 4) Redo Members LOG01B.dbf - 1500M (Mirror of Group 1) LOG03B.dbf -1 500M (Mirror of Group 3) Filesystem 1 (vxfs) Layout=Stripe STWidth = 1M /u05/redo1 Filesystem 2 (vxfs) Layout=Stripe STWidth = 1M /u05/redo2 Filesystem 3 (vxfs) Layout=Stripe STWidth = 1M /u05/redo3 Filesystem 4 (vxfs) Layout=Stripe STWidth = 1M /u05/redo4 } Raid 1 Single Lun of 66.4GB Raid 1 Single Lun of 66.4GB 72GB, 15K RPM FC } 72GB, 15K RPM FC 72GB, 15K RPM FC } 72GB, 15K RPM FC 26 Final Configuration Details (contd.) Redo groups 4 redo groups configured with 2 members each. The log members were placed in such a manner on the redo filesystems to eliminate LGWR and ARCH IO contention. Each member was 8G in size (8G log members would reduce the log switches from 1 switch per minute to 1 switch every 7 minutes). Reducing log switches improves performance as during a log switch, redo generation is disabled. 8G was an ideal size – log archiving completed within 2 minutes whereas log switches happened every 7 minutes. Increasing the log member size resulted in higher wait times for the events – log file switch completion and log file sequential read. However the overall performance gain was well worth it. 27 Final Configuration Details (contd.) Session The log buffer was set to 72M (after several iterations). A 72M log buffer along with 8G log file members resulted in an higher response time for the event – log file switch completion. However we completely eliminated the wait event – log buffer space (even when simulating 1.5X load). 72M appears to be an ideal size for a redo generation rate up to 14MB/sec. The _log_io_size is set to a maximum of 1M irrespective of the log_buffer size once the log_buffer crosses ~ 6MB. Also since we had a Storage subsystem which was quite capable of handling upto 32M in a single write within acceptable response time, we did not downsize the log_io_size. 28 Final Configuration Details (contd.) Session (contd.) Improving the LGWR write performance however resulted in the redo allocation latch contention. To reduce the redo allocation latch contention, we increased the parallelism for the shared redo buffer from the default of 2 to 12. _log_parallelism_max = 12 # Default is 2. Max - Limited to CPU count _log_parallelism = 4 # Default is 1 By enabling log_parallelism, the shared log buffer is split into log_parallelism_max sections each assigned a redo allocation latch. As per oracle documentation, the redo allocation latch for the shared log buffer is randomly assigned to the requestor and then does a round-robin allocation. We did notice that this was not an optimal way of assignment. 29 Final Configuration Details (contd.) ARCH The ARCH process reads OS sized blocks as set by the _log_archive_buffer_size parameter. The default and maximum value on Solaris with Oracle 10g is 2048 OS blocks (equates to 1MB reads). So the archive logs filesystem was also created as a stripe filesystem with 1MB stwidth. Performance improved as the redo logs filesystems and the archive filesystems were both stripe filesystems with 1MB stripe width. Average ARCH throughput was around 150MB/sec. However we did notice that the ARCH process reads from the primary group member only. It does not read simultaneously from both the members. We did not change the log_archive_max_processes from default (2). 30 Final Results Peak Work load showed an improvement of 7x. Least improvement was 4x. At 1.5X load, the scalability was near linear. 31 The results – Work Load Profile (redo) # Session Session Session Session LGWR LGWR LGWR Stat Before After commit wait requested 72/sec 520/sec commit wait performed 72/sec 520/sec redo entries generated 13326 entries/sec 14677 entries/sec redo size 9.45MB/sec 10.1MB/sec redo synch writes 72 sync writes/sec 520 sync writes/sec redo writes 74 writes/sec 845 writes/sec lgwr throughput 9.8 MB/sec 10.75 MB/sec 32 The results – Waits Before Event latch: redo allocation latch: redo writing log buffer space log file parallel write log file sequential read log file switch completion log file sync log file switch (checkpoint incomplete) Waits/sec Negligible Negligible 3.5 waits/sec 74 waits/sec 10 waits/sec 0.5 waits/sec 72 waits/sec 0.5 waits/sec Before (average wait time) Negligible Negligible 28.6 ms 11.8 ms 7.02 ms 20.82 ms 26.45 ms 0.38 ms Waits/sec 0.002 waits/sec 0 waits/sec 0 waits/sec 845 waits/sec 10.5 waits/sec 0.02 waits/sec 519 waits/sec 0 waits/sec After Target After Threshold (average wait time) (average wait time) 0.9 ms Eliminate 0 ms Eliminate 0 ms Eliminate 0.55 ms < 2ms 16.62 ms ? 31.5 ms ? 2.13 ms < 5 ms 0 ms Eliminate 33 The results – Stats 34 The results – System # CPU CPU Parameter Average Run Queue Average Involuntary context switches for LGWR Before 1.2 30% 8ms (Reads) 12ms (Write) 8ms (Reads) 12ms (Write) 8ms (Reads) 12ms (Write) After 1.2 < 0.1 % 16ms (Reads) < 1ms (Write) 16ms (Reads) < 1ms (Write) 16ms (Reads) < 1ms (Write) Threshold No Change Eliminate 10ms (Reads) 2ms (Write) 10ms (Reads) 2ms (Write) 10ms (Reads) 2ms (Write) Storage Average Redo Lun response time Storage Average File response time (redo logs) Storage Average Redo Volume Response Time 35 Final Thoughts In order of biggest impact to performance (in descending order), 1. IO Subsystem (50%) 2. Redo Groups layout and sizing of log file members (20%) 3. CPU Scheduling (15%) 4. Log Buffer (10%) 5. Log Parallelism (5%) The LGWR process in 10g is incredibly efficient requiring minimal tuning, however it would have been ideal if there was dedicated LGWR for each shared strand. One can only imagine the performance gain with multiple LGWR each servicing distinct log buffers. 36 </div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1550840387946350137.post-89138270686280090802008-08-07T11:03:00.000-07:002008-09-16T17:08:20.792-07:00Oracle Database Capacity PlanningI do capacity planning for the EBI infrastructure at the place I work and created a presentation to share the methodology with my team.<br /><br /><object codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=9,0,0,0" id="doc_680701326874207" name="doc_680701326874207" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" align="middle" height="600" width="600"> <param name="movie" value="http://documents.scribd.com/ScribdViewer.swf?document_id=6063146&access_key=key-239v5id3s4ivevttwxbn&page=1&version=1&viewMode=list"> <param name="quality" value="high"> <param name="play" value="true"> <param name="loop" value="true"> <param name="scale" value="showall"> <param name="wmode" value="opaque"> <param name="devicefont" value="false"> <param name="bgcolor" value="#ffffff"> <param name="menu" value="true"> <param name="allowFullScreen" value="true"> <param name="allowScriptAccess" value="always"> <param name="salign" value=""> <param name="mode" value="list"> <embed src="http://documents.scribd.com/ScribdViewer.swf?document_id=6063146&access_key=key-239v5id3s4ivevttwxbn&page=1&version=1&viewMode=list" quality="high" pluginspage="http://www.macromedia.com/go/getflashplayer" play="true" loop="true" scale="showall" wmode="opaque" devicefont="false" bgcolor="#ffffff" name="doc_680701326874207_object" menu="true" allowfullscreen="true" allowscriptaccess="always" salign="" type="application/x-shockwave-flash" align="middle" mode="list" height="600" width="600"></embed> </object><div style="font-size:10px;text-align:center;width:600"><a href="http://www.scribd.com/doc/6063146/Oracle-Database-Capacity-Planningv3">Oracle Database - Capacity Planningv3</a> - <a href="http://www.scribd.com/upload">Upload a Document to Scribd</a></div><div style="display:none"> Read this document on Scribd: <a href="http://www.scribd.com/doc/6063146/Oracle-Database-Capacity-Planningv3">Oracle Database - Capacity Planningv3</a> </div><div style="display:none"> <br />Oracle Database – Capacity Planning Krishna Manoharan krishmanoh@gmail.com 1 Introduction – Capacity Planning Capacity planning is essential to deliver a predetermined optimal/consistent user experience throughout the lifecycle of a solution. Capacity planning refers to identifying the changes (from a capacity perspective only) needed to be done to the environment to maintain this predefined user experience over the lifecycle of the solution. In the simplest of terms, this changes can refer to adding more CPU/Memory/Storage/Network capabilities along with suitable configuration changes to the application (Grid, version upgrades, 64bit vs 32bit etc) as and when identified to be required. 2 Capacity Planning or Performance Tuning ? Capacity planning is pro-active whereas performance tuning is mostly reactive. Capacity planning is anticipating demand ahead of time and recommending suitable changes to the environment. Capacity planning (unlike performance tuning) is not an exact science in the sense it requires some guess work based on prior history and experience with the environment. I would feel, Performance tuning is trying to get the best out of existing infrastructure – for e.g – rewriting sql, creating an index etc. When a user complains of poor performance of the application, it is important that you establish if this is a capacity constraint resulting in sub optimal user experience or code/application issues. Capacity planning can help identify performance issues early on. 3 Capacity Planning Model Collect Stats New Requirements Establish Pattern and Behaviour Monitor (Update Profile with Stats on a Regular basis) Create Profile Thresholds Predict Possible Threshold Violations Threshold Violations Establish Thresholds One Time Define Action Plan for threshold violations Performance (Application) Capacity Constraint Resolve Change based on Action Plan Thresholds under control 4 Collecting stats & Profiling The first step is to identify suitable stats and capture them (assuming the application is in steady state). Statistics is from the application as well as the from the infrastructure (CPU/Memory/Storage etc). Then, one needs to start with profiling the application. Profiling the environment will help in Understanding the needs of the environment Correlating statistics from the application with the infrastructure. Charting and predicting growth using the previously established thresholds and As a result - proper capacity planning to meet the growth. 5 Profiling an application (contd.) Profile is basically a snapshot of the application. It enables you to see how it is performing with key statistics and changes over a period of time. Profiling can in-turn help identify performance issues and bottlenecks as an additional benefit during the process of capturing statistics. Once profiling is done, the next step is to establish thresholds. 6 Thresholds Thresholds indicate your comfort level – for e.g. – Redo/Day cannot exceed 50GB/day beyond which I need to revisit my redo configuration. Thresholds need to be defined and set for the key statistics in the profile. Also you identify the course of action to be followed if a threshold is violated. Reviewing the key statistics in the profile on a daily/weekly basis will allow you to plan in advance as to what changes need to be done. 7 Oracle perspective How can I do capacity planning on a pro-active basis for my oracle instance? 8 Oracle – Capacity planning The answer to this lies in reviewing, collating and corroborating Oracle statistics with statistics from various other subsystems such as OS, Storage and Network over a period of time. The key is to know which statistics to look at, how to interpret the numbers and establish thresholds. It is essential to know when to drill down into session level stats and when to stick to the top level as otherwise the stats will become overwhelming. Not to forget, Capacity planning is pro-active whereas performance tuning is mostly reactive. 9 Oracle Stats and Wait events From an oracle perspective, both stats as well as wait events needs to be captured on an ongoing basis. One would capture stats at an instance level and if required at a session level. To begin with, one can start with instance level stats collected every 24 hrs. The finer the interval, the more accurate the results, however it can get very cumbersome. It is best not to use the dba_hist views/AWR, but rather collect the stats from the v$views. The v$views are mostly incremental views and contain data from the instance startup time. 10 Oracle Stats and Wait events – contd. Stats can be collected for Work Load User related (Transactions, logons, parses etc) Redo activity Undo activity Temp activity Tablespace and object space usage PGA usage SGA usage Parallel Operations IO Operations File Stats and Temp Stats 11 Oracle Stats and Wait events – contd. Wait events help mostly in performance tuning and identifying steady state behaviour. For wait events, Top 10 waits including CPU time ordered by Time Waited along with Average Wait time Total Waits Wait class Filter out idle and parallel (PX*) waits. 12 Infrastructure Statistics From an Infrastructure perspective, to begin with the following stats can be collected. CPU – Utilization, run queue, context switches (voluntary and involuntary), interrupts, system calls, thread migrations) Storage – Number of IOPS/second, Queue Depth, Size of IOPS, Response time (lun level, volume and file), throughput. Filesystem – Usage, response time and growth. Memory – Physical memory consumed, swap in/out, page faults Network – Throughput and details from netstat –s and kstat. It is important to note that OS stats are generally not event driven and are time sampled. So they need to be correlated with application stats to make sense. 13 Basic Oracle Instance profile These stats allow us to create a simple and basic profile of the instance which can be used for daily reporting (shown next slide). It is important to note that even though many magnitudes of statistics are collected everyday, the profile should present only sufficient information to enable a decision to warrant further investigation if required. 14 A Simple profile for a Datawarehouse instance 15 A Simple profile – contd. DW Event read by other session db file sequential read db file scattered read direct path read temp log file sync log file parallel write direct path write temp db file parallel write control file parallel write os thread startup Class User I/O User I/O User I/O User I/O Commit System I/O User I/O System I/O System I/O Concurrency Average Wait (centiseconds) -90 days ( % Delta) -30 days (% Delta) -7 days (% Delta) 2 2.1 3 0.6 0.5 0.5 0.8 0.76 0.81 1.56 1.8 1.6 Not Present Not Present 0.5 Not Present Not Present 0.12 1.73 1.5 1.32 0.16 0.12 0.15 Not Present Not Present 0.98 18 14 18 Today 2 0.6 0.8 1.6 1 0.2 1.7 0.15 1 18 Threshold 2 0.5 <1 < 1.5 0 0 <1 < 0.1 0 < 12 16 Oracle - Capacity planning (contd.) To summarize Profile the environment Collect and collate initial set of statistics when environment is steady state and user response time is deemed satisfactory – Oracle, OS, Storage, Network . Define and establish thresholds – Oracle, OS, Storage and Network. As before, user response time should be deemed satisfactory. Repeat statistics collection over a defined period of time – Maybe monthly or quarterly. Establish a pattern of change – certain statistics increase over a period of time, whereas others decrease. Based on the pattern of change, plan on adding additional capacity. At any point during this time, bottlenecks can be identified and resolved accordingly. 17 Oracle Stats and Waits – v$views Common Views v$sysstat v$sys_time_model v$pgastat v$sgainfo v$filestat v$tempstat dba_free_space v$sesstat v$system_event v$session_event v$segstat Comments Most oracle statistics CPU Wait Statistics PGA Statistics SGA Statistics File IO Statistics Temp file statistics Tablespace space usage Session Statistics Wait Statistics Session Wait Statistics Segment Statistics 18 Oracle Stats Detail (Can be collected on a daily basis) Workload NAME db block changes DB time CPU used by this session Redo NAME redo buffer allocation retries redo log space requests redo log space wait time redo blocks written redo entries redo size redo writes background checkpoints completed redo synch writes redo synch time redo size redo write time redo wastage User related NAME opened cursors cumulative parse count (failures) parse count (hard) parse count (total) parse time cpu execute count logons cumulative user commits user rollbacks Source v$sysstat v$sysstat v$sysstat Source v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat Source v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat PGA NAME aggregate PGA target parameter aggregate PGA auto target maximum PGA allocated global memory bound total PGA used for auto workareas over allocation count cache hit percentage sorts (disk) sorts (memory) sorts(rows) workarea executions - multipass workarea executions - onepass workarea executions - optimal workarea memory allocated SGA NAME Buffer Cache Size Shared Pool Size Large Pool Size Maximum SGA Size Free SGA Memory Available prefetched blocks aged out before use Undo NAME consistent gets undo change vector size consistent changes DBWR undo block writes transaction rollbacks Source v$pgastat v$pgastat v$pgastat v$pgastat v$pgastat v$pgastat v$pgastat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat Source v$sgainfo v$sgainfo v$sgainfo v$sgainfo v$sgainfo v$sysstat Source v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat 19 Oracle Stats – Sample Parallel NAME DDL statements parallelized DFO trees parallelized DML statements parallelized Parallel operations downgraded 1 to 25 pct Parallel operations downgraded 25 to 50 pct Parallel operations downgraded 50 to 75 pct Parallel operations downgraded 75 to 99 pct Parallel operations downgraded to serial Parallel operations not downgraded queries parallelized IO Related NAME physical read total bytes physical read total IO requests physical reads direct physical reads direct temporary tablespace physical read total multi block requests physical write total bytes physical write total IO requests physical write total multi block requests physical writes direct physical writes direct temporary tablespace user I/O wait time Source v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat Source v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat (Can be collected on a daily basis) Enqueue NAME enqueue timeouts enqueue waits enqueue deadlocks enqueue requests enqueue conversions enqueue releases Table and Index NAME table scans (short tables) table scans (long tables) table scans (rowid ranges) table scans (direct read) table fetch by rowid table fetch continued row index fast full scans (full) index fast full scans (rowid ranges) index fast full scans (direct read) index fetch by key Source v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat Source v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat v$sysstat 20 Oracle Stats Detail – Sample (Can be collected on a daily basis) DATE 01-Aug-08 02-Aug-08 03-Aug-08 04-Aug-08 EVENT db file scattered read db file scattered read db file scattered read db file scattered read TOTAL_WAITS TOTAL_TIMEOUTS TIME_WAITED AVERAGE_WAIT 1838346 0 6033377 0.8 1906533 0 6034577 0.75 1754866 0 5965344 0.9 2356571 0 6154334 0.23 WAIT_CLASS User I/O User I/O User I/O User I/O 21 Oracle Stats Detail Used Space in MB Date 01-Aug-08 02-Aug-08 03-Aug-08 04-Aug-08 Avl Space in MB Date 01-Aug-08 02-Aug-08 03-Aug-08 04-Aug-08 Tablespace1 25000 25120 25220 25989 (Can be collected on a daily basis) Tablespace2 Tablespace3 Tablespace n 31000 14000 13210 32001 14990 13210 32150 15010 13210 33000 15201 13210 Tablespace1 10000 9880 9780 9011 Tablespace2 Tablespace3 Tablespace n 4000 21000 21790 2999 20010 21790 2850 19990 21790 2000 19799 21790 22 Oracle Stats Detail Datafile /DW/dat01/file1.dbf /DW/dat01/file2.dbf /DW/dat01/file3.dbf /DW/dat01/file4.dbf .. .. .. /dev/vx/rdsk/dwdg/dwtmp0 (Can be collected on a daily basis) IOPS/Day Avg Response Time/Day Max Response Time/Day 35000 10ms 24ms 120000 12ms 15ms 68461 15ms 18ms 58799 8ms 10ms 130000 30ms 68ms 23 Infrastructure Statistics (Can be collected on a daily basis) Infrastructure Stats cpu user time cpu sys time context switches (inv and vol) system calls Filesystem usage Thread migrations Interrupts Run queue Network Stats Memory Stats File IO Stats (Complements oracle) Volume Stats Lun Stats Queue depth, throughput, response time Storage allocated Comments sar, vmstat, mpstat sar, vmstat, mpstat mpstat sar, vmstat, mpstat df mpstat vmstat, mpstat vmstat, sar, w netstat and kstat vmstat odmstat vxstat vxdmpadm, swat and iostat vxstat, odmstat, swat, iostat vxdg 24 </div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1550840387946350137.post-41847839408609994012008-07-10T17:37:00.001-07:002008-07-18T17:09:32.762-07:00Simple Userland tools on Unix to help analyze application impact as a non-root user - Storage SubsystemNext in the series of presentations to my team - this time to analyze storage performance and impact - specific to oracle.<br /><br /><object codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=9,0,0,0" id="doc_794614972843146" name="doc_794614972843146" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" align="middle" height="600" width="600"> <param name="movie" value="http://documents.scribd.com/ScribdViewer.swf?document_id=3996500&access_key=key-9zyi8ihckdku5inyppq&page=1&version=1"> <param name="quality" value="high"> <param name="play" value="true"> <param name="loop" value="true"> <param name="scale" value="showall"> <param name="wmode" value="opaque"> <param name="devicefont" value="false"> <param name="bgcolor" value="#ffffff"> <param name="menu" value="true"> <param name="allowFullScreen" value="true"> <param name="allowScriptAccess" value="always"> <param name="salign" value=""> <embed src="http://documents.scribd.com/ScribdViewer.swf?document_id=3996500&access_key=key-9zyi8ihckdku5inyppq&page=1&version=1" quality="high" pluginspage="http://www.macromedia.com/go/getflashplayer" play="true" loop="true" scale="showall" wmode="opaque" devicefont="false" bgcolor="#ffffff" name="doc_794614972843146_object" menu="true" allowfullscreen="true" allowscriptaccess="always" salign="" type="application/x-shockwave-flash" align="middle" height="600" width="600"></embed> </object><div style="font-size:10px;text-align:center;width:600"><a href="http://www.scribd.com/doc/3996500/Analyze-application-impact-Storage-Subsystemv2">Analyze application impact - Storage Subsystemv2</a> - <a href="http://www.scribd.com/upload">Upload a Document to Scribd</a></div><div style="display:none"> Read this document on Scribd: <a href="http://www.scribd.com/doc/3996500/Analyze-application-impact-Storage-Subsystemv2">Analyze application impact - Storage Subsystemv2</a> </div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1550840387946350137.post-51072878456086890442008-05-09T17:59:00.000-07:002008-06-07T18:32:34.351-07:00Simple Userland tools on Unix to help analyze application impact as a non-root user - CPU subsystemI recently gave a presentation to my team on how to analyze the impact of the application on the CPU subsystem.<br /><br /><object codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=9,0,0,0" id="doc_221674792152891" name="doc_221674792152891" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" align="middle" height="600" width="600"> <param name="movie" value="http://documents.scribd.com/ScribdViewer.swf?document_id=3271807&access_key=key-1kvpvowf3t2fn2td9njs&page=1&version=1"> <param name="quality" value="high"> <param name="play" value="true"> <param name="loop" value="true"> <param name="scale" value="showall"> <param name="wmode" value="opaque"> <param name="devicefont" value="false"> <param name="bgcolor" value="#ffffff"> <param name="menu" value="true"> <param name="allowFullScreen" value="true"> <param name="allowScriptAccess" value="always"> <param name="salign" value=""> <embed src="http://documents.scribd.com/ScribdViewer.swf?document_id=3271807&access_key=key-1kvpvowf3t2fn2td9njs&page=1&version=1" quality="high" pluginspage="http://www.macromedia.com/go/getflashplayer" play="true" loop="true" scale="showall" wmode="opaque" devicefont="false" bgcolor="#ffffff" name="doc_221674792152891_object" menu="true" allowfullscreen="true" allowscriptaccess="always" salign="" type="application/x-shockwave-flash" align="middle" height="600" width="600"></embed> </object><div style="font-size:10px;text-align:center;width:600"><a href="http://www.scribd.com/doc/3271807/Analyze-application-impact-CPU-Subsystem">Analyze application impact - CPU Subsystem</a> - <a href="http://www.scribd.com/upload">Upload a Document to Scribd</a></div><div style="display:none"> Read this document on Scribd: <a href="http://www.scribd.com/doc/3271807/Analyze-application-impact-CPU-Subsystem">Analyze application impact - CPU Subsystem</a> </div><div style="display:none"> <br />Analyzing Application impact using simple userland tools – CPU subsystem Krishna Manoharan krishmanoh@gmail.com 1 Introduction Every application impacts the host Operating system and connected sub-systems in a unique way. In order to profile an application and understand its impact on the environment, there are a number of userland tools provided within the OS. Many of these tools do not require super-user privileges thus enabling ordinary users such as dba’s or application developers, the ability to see and gauge the impact of the application on the system. 2 Subsystems in an environment One needs to analyze the impact of an application on all the major subsystems in an environment. CPU Memory Storage Network 3 CPU Subsystem - Metrics The key metrics to look for when measuring CPU utilization is Run queue – The number of running threads or threads waiting to run on a CPU. Utilization – CPU utilization Context switches (Voluntary and Involuntary) - Context switch is when a kernel thread is suspended from execution on the CPU with its current state captured and another thread is scheduled. Thread Migrations – Shows kernel threads which re-awake after a context switch and are running on a different CPU than previous run. Interrupt generation – Interrupts are signals from hardware devices such as network, disk etc or from software which force the CPU to process them immediately. System Calls - System calls allow an application to request the kernel to perform privileged operations such as IO, new threads etc. Translation Lookaside Buffer Miss – TLB is a table maintained in the CPU which contains a subset of mappings of virtual memory addess to actual physical address. When a CPU need to convert a virtual memory address to a physical address, it searches the TLB first and if not found, it is a miss. TLB misses are expensive. 4 CPU Subsystem - Tools To measure the impact of an application on the CPU subsystem, the tools available are sar - Gathers and reports on a variety of statistics. Make sure it is set to 5 minute intervals rather than the default. vmstat – Reports CPU related statics mpstat - Reports CPU related statistics prstat - Reports process related statistics truss - Used in debugging ps – Reports process related statistics trapstat - Reports TLB misses. Of all these, only trapstat needs to be run as root. 5 CPU Subsystem – Run Queue There are many schools of thought regarding the size of the run queue. Normally one would say that a run queue of 2 or more shows a CPU bottleneck. On a SMP system, each CPU has its own run queue and so for a 8 CPU system, would a run queue of 16 would be okay? It all depends on what the user perception is. You may have a run queue of 100 and still be okay with it if you are meeting your SLA. I personally have seen Tibco Business Works running on Solaris and racking up a sustained run queue of +80 on a 4 CPU V440 and still be averagely responsive. Similarly a Oracle 10g RAC installation running on a Dell 2950 with 8 CPU's, a sustained run queue of +12 would leave the system pretty much unresponsive and unusable. I would feel that a sustained run queue of 10 or more coupled with a lot of involuntary context switches means you definitely have a CPU bottleneck. All this assuming that your average CPU utilization is well within limits (average 30-50%). The important thing to note about the run queue when analyzing application impact is that none of these tools discussed above will tell you if the run queue is due to the application or other process. In order to find out what exactly is causing it, you would need to use prstat. 6 CPU Subsystem – Run Queue contd. The run queue can be obtained either using vmstat, w or sar. kthr rbw memory swap free re page disk in faults sy cpu cs us sy id mf pi po fr de sr s9 s3 s3 s3 0 13 0 84634792 97274944 103 573 96 26 24 0 0 0 0 0 0 4377 31853 14630 23 3 74 The run queue can also be obtained using the sar command. This gives a historic perspective. mkrishna@tus1dwhdbspex01:> sar –q SunOS tus1dwhdbspex01 5.9 Generic_122300-16 sun4u 00:00:00 runq-sz %runocc swpq-sz %swpocc 00:05:00 1.0 2 0.0 0 06/04/2008 The w command also gives the load average for the last 1, 5, 15 minutes. The load average is the run queue. mkrishna@tus1dwhdbspex01:> w 8:30pm User root up 24 day(s), 22:12, tty console login@ 4 users, idle load average: 7.71, 8.87, 8.86 JCPU PCPU what format 10May0825days 7 CPU Subsystem – Utilization Measure of how busy the CPU is. CPU can be busy either processing high complexity (compression, encyption, Math) threads resulting in high cpu utilization or busy processing medium to low complex threads showing lower utilization. You can have a high run queue and low CPU utilization. Overall CPU utilization can be obtained from vmstat sar Process wise CPU utilization can be obtained from prstat ps Individual CPU utilization can be obtained from mpstat 8 CPU Subsystem – Utilization – Contd. Overall CPU utilization - Measure using vmstat and sar vmstat – In an ideal scenario, % system time will be less than 5%. % user time will be the real measure of CPU utilization. % wait is to be ignored completely. kthr rbw memory swap free re page disk faults in sy cs cpu us sy id 23 3 27 2 74 72 mf pi po fr de sr s9 s3 s3 s3 0 13 0 84634792 97274944 103 573 96 26 24 0 0 0 0 0 0 4377 31853 14630 0 19 0 77838000 93835392 0 21 0 0 0 0 0 0 0 0 0 5997 31834 9154 sar - Gives historical reporting. More details on how to get sar reports for previous days. mkrishna@tus1dwhdbspex01:> sar –u SunOS tus1dwhdbspex01 5.9 Generic_122300-16 sun4u 00:00:00 00:05:00 %usr 31 %sys 3 %wio 38 %idle 29 06/04/2008 9 CPU Subsystem – Utilization – Contd. Process CPU utilization prstat – Solaris equivalent to top. Gives a considerable amount of detailed information. In order to see individual process cpu utilization and overall user cpu utilization, run as prstat –a 1. prstat also can give a break up of the run-queue and which thread/process is currently executing. mkrishna@tus1dwhdbspex01:> prstat -a 1 PID USERNAME 29192 oraodsup NPROC USERNAME SIZE 20G SIZE RSS STATE 11G cpu522 RSS MEMORY 77% PRI NICE 10 0 TIME 57:08:28 TIME CPU PROCESS/NLWP 0:12:59 3.1% oracle/1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ CPU 26% 185 oraodsup 3579G 1969G The STATE column gives the currently running processes and the CPU column gives the % CPU consumed by that process. The Output at the bottom gives overall CPU consumed by a specific user account. 10 CPU Subsystem – Utilization – Contd. Process CPU utilization over time • The ps command will give the overall CPU time consumed by a process. The time column shows the amount of CPU execution time. This will give an idea as to what process has consumed the most CPU time since system boot. mkrishna@tus1dwhdbspex01:> ps -e -o time,args |more TIME COMMAND 0:12 sched 4:38 /etc/init 0:00 pageout 1-23:03:57 fsflush 0:00 -sh 0:10 /usr/lib/inet/in.mpathd –a 11 CPU Subsystem – Utilization – Contd. Individual CPU Utilization The mpstat command will give the individual CPU utilization on a SMP system. Checking for CPU utilization using mpstat allows you to see if all the CPU’s are being used and if SMP is really happening. mkrishna@tus1dwhdbspex01:> mpstat CPU minf mjf xcal 0 1 36 16 0 0 926 914 intr ithr 313 18 201 1 csw icsw migr smtx 475 480 14 13 34 34 81 71 srw syscl 0 0 1125 929 usr sys 28 21 2 2 wt idl 30 34 40 44 12 CPU Subsystem – Context switching and Thread Migrations Context switch (CS) is when a kernel thread is suspended from execution on the CPU with its current state captured and another thread is scheduled. Thread migrations happen when a thread was running on a particular CPU and was CS out. Then when it is ready to run, it is scheduled on a different CPU. Thread migrations are not efficient because the current CPU Cache is not warm with the thread information. Context switches happen as a result of the below: » Time quanta expired for the running thread – Causes Involuntary CS » Higher priority thread needs to run - Causes Involuntary CS » Interrupt is being serviced (not 100% context switch) - Causes Involuntary CS » Thread has completed it’s activity or issued a blocking system call – Causes Voluntary CS An Involuntary CS means that a thread did not get an opportunity to complete it’s work. A system can have 100,000 context switches and still meet SLA’s if most of the CS are voluntary in nature. A high degree of involuntary context switches coupled with thread migrations usually means there is a CPU related bottleneck. 13 CPU Subsystem – Context switching and Thread Migrations – Contd. The tools to measure context switches and thread migrations are • • • • vmstat mpstat prstat sar vmstat mkrishna@tus1dwhdbspex01:> vmstat 1 kthr rbw memory swap free re page disk in faults sy cpu mf pi po fr de sr s9 s3 s3 s3 cs us sy id 0 13 0 84568400 97239720 102 573 95 26 24 0 0 0 0 0 0 4398 31925 14622 23 3 74 0 18 0 76750792 91878616 6 146 0 63 55 0 0 0 0 0 0 6712 45464 11107 33 3 64 mpstat mkrishna@tus1dwhdbspex01:> mpstat CPU minf mjf xcal 0 36 0 925 intr ithr 313 201 csw icsw migr smtx 474 15 34 81 srw syscl 0 1125 usr sys 28 2 wt idl 30 40 14 CPU Subsystem – Context switching and Thread Migrations – Contd. prstat – using the –m switch which shows specific process related statistics. It allows you to drill down as to which process is being context switched. mkrishna@tus1dwhdbspex01:> prstat –n 1000 -m 1 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 5221 oraodsup 100 0.0 - 0.0 0 11 2 0 oracle/1 5400 oraodsup 29 2.4 69 - 235 28 1K 0 oracle/1 sar -w – shows history of context switches mkrishna@inftpd1:> sar -w SunOS inftpd1 5.10 Generic_118833-36 sun4u 06/06/2008 00:00:00 swpin/s bswin/s swpot/s bswot/s pswch/s 00:05:00 0.00 0.0 0.00 0.0 1420 15 CPU Subsystem – Interrupts Interrupts are mechanisms by which a hardware device (HBA, Network adaptor) or application (signals) can get CPU attention. Interrupts are generated to specific CPU’s. When an interrupt is generated, the current running thread is pinned and the interrupt serviced. The pinned thread is then again resumed. Too many interrupts can hurt application performance. 16 CPU Subsystem – Interrupts contd. Interrupts can be monitored using mpstat and vmstat vmstat mkrishna@tus1dwhdbspex01:> vmstat 1 kthr rbw memory swap free re page disk faults cpu cs us sy id mf pi po fr de sr s9 s3 s3 s3 in sy 0 13 0 84568400 97239720 102 573 95 26 24 0 0 0 0 0 0 4398 31925 14622 23 3 74 0 18 0 76750792 91878616 6 146 0 63 55 0 0 0 0 0 0 6712 45464 11107 33 3 64 mpstat mkrishna@tus1dwhdbspex01:> mpstat CPU minf mjf xcal 0 36 0 925 intr ithr 313 201 csw icsw migr smtx 474 15 34 81 srw syscl 0 1125 usr sys 28 2 wt idl 30 40 17 CPU Subsystem – System Calls System calls are API which allow a process to request privileged operations such as IO, memory allocation or new thread creation etc. It is important to identify the time spent servicing system calls. Too many system calls (such as the timed_statistics in oracle) can swamp the system. 18 CPU Subsystem – System Calls contd. System calls can be monitored using vmstat, mpstat, prstat and sar (historical trending) vmstat mkrishna@tus1dwhdbspex01:> vmstat 1 kthr rbw memory swap free re page disk in faults cpu cs us sy id mf pi po fr de sr s9 s3 s3 s3 sy 0 13 0 84568400 97239720 102 573 95 26 24 0 0 0 0 0 0 4398 31925 14622 23 3 74 0 18 0 76750792 91878616 6 146 0 63 55 0 0 0 0 0 0 6712 45464 11107 33 3 64 mpstat mkrishna@tus1dwhdbspex01:> mpstat CPU minf mjf xcal 0 36 0 925 intr ithr 313 201 csw icsw migr smtx 474 15 34 81 srw syscl 0 usr sys 28 2 wt idl 30 40 1125 19 CPU Subsystem – System Calls contd. sar -c mkrishna@inftpd1:> sar -c |more SunOS inftpd1 5.10 Generic_118833-36 sun4u 00:00:00 scall/s sread/s swrit/s 00:05:00 fork/s 0.07 06/06/2008 exec/s rchar/s wchar/s 0.07 56915 41169 1400 103 21 20 CPU Subsystem – System Calls contd. The truss utility executes the specified command and produces a trace of the system calls it performs, the signals it receives, and the machine faults it incurs. truss is very useful to trace system calls and the time spent on a particular call. 21 CPU Subsystem – System Calls contd. In 126 seconds, 50 seconds were spent on reading data from the cache folder. root@ustu-infapppdin21:> truss -c -p 23050 syscall seconds calls errors read 40.740 639326 write 1.313 29562 time .079 4286 lseek 9.807 608657 fcntl .000 35 lwp_park .197 5170 1 lwp_unpark .142 5168 poll .046 731 yield .001 29 ------------- ---sys totals: 52.328 1292964 1 usr time: 25.691 elapsed: 126.210 22 Historic sar data How do I read historical sar data? By default sar data is saved in /var/adm/sa for 7 days. BI systems have been customized to save data for 30 days. To read sar data from previous days, sar –f /var/adm/sa/sa<01-29> -q|u|w|etc 23 So much data and how do I correlate? If you have kept up with me so far, you will begin to wonder as to how do I take all this data in and make it meaningful? It depends on what you wish to accomplish. – Short Term Reactive – Check on a possible CPU bottleneck right away – Long Term – Do trending and see how usage has changed over a period of time. For long term Sar data is invaluable, however it is not complete. Sar captures run queue, utilization, system calls and context switches. However sar does not capture interrupts, icsw or thread migrations to measure CPU load, so custom scripts needs to be written to capture this information. Long term – it is difficult to capture specific process/user related information unless process accounting is installed. A better option for the above would be tools such as sitescope, 24 Short Term Reactive approach Suspect CPU Bottleneck ? Application confirms (AWR/ASH) or poor performance. System related processes are not the cause. Check using w (load average), vmstat, sar (run queue and utilization) Debug process? e.g. Do a truss –c –p <PID> and see where time is being spent by the process on which system call. From AWR/ASH reports, identify and analyze the sql. Check using prstat –m –a 1 for processes using up CPU. Check for cpu, cs, icsw and system calls for specific process(s). Identify process either from application or through ps Further action (application/sql analysis, add more cpu’s, breakup/stagger load etc) 25 Long Term Pro-Active approach Setup sar for running every 5 minutes. Either setup custom scripts to capture other data such as icsw, interrupts etc or install tools such as BMC Patrol/Sitescope/Openv iew Prepare trending reports for CPU run queue, utilization, cs, icsw, thread migrations and interrupts over a period of time (say 3 months). This gives an overall picture of CPU related metrics. Further action 26 From an oracle perspective All this is fine, how do I know oracle sees a CPU bottleneck? The answer is in AWR/ASH reports – AWR/ASH reports will show a CPU wait in the top 5 waits Table dba_hist_sysmetric_summary (Oracle 10G) The “Current OS Load: column will show the run queue as seen by oracle for every AWR snapshot. 27 </div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1550840387946350137.post-13986510548643713812008-03-29T18:59:00.000-07:002008-04-10T17:44:27.076-07:00Implementing Veritas Flashsnap for database backupsSnapshot Technology is a fast and efficient method to take backups of large scale databases. Unlike RMAN which is Oracle based, Snapshots uses mirroring technologies managed by the system/storage. There are many variants of snapshots available and the most commonly used method is one in which an entire copy of the database is made. This permit off-host backups and can be incrementally refreshed on a daily basis.<br /><br />Veritas Flashsnap is a Point in Time Data Mirroring solution using Veritas technologies. Veritas Flashsnap is tightly integrated into Veritas Volume Manager (VxVM) and Veritas Filesystem (VxFS). It requires a separate license to be purchased and enabled (at least for versions 4.x). It is important to note that Flashsnap is VxVM Volume based rather than LUN based (as Shadow Image or Timefinder).<br /><br />This article is an attempt to create a reference configuration for implementing Flashsnap to backup an Oracle database using Storage Foundation for Oracle 4.0.<br /><br />Flashsnap has many options available as to the kind of Point in Time Mirror to be deployed. This article focuses on the Full sized Instant Snapshot option by which a mirror is taken of the entire database on dedicated LUNs which can then be imported onto a Media Server for backup to tape.<br /><br />Flashsnap has inherent advantages over Shadow Image (HDS) or Timefinder (EMC) technologies -<br /><br /><ul><li>Since it is VxVM volume based, the LUNs used for the mirror can be on different storage array (e.g. a low cost array using SATA drives) than your primary storage. If you were to use Shadow Image etc, then you would need to dedicate storage on the primary array as used by the host.</li><li>The rate of synchronization between source (database) volumes and target (dedicated backup) volumes can be controlled with fine granularity.</li><li>Unlike Array based solutions, VxVM allows you to split and deport the target volumes as a different disk group from the parent. No need for complicated scripts to change the disk group name when you wish to import the backup LUNs on the source system to do a recovery.</li><li>In most big shops, the primary storage will be an Enterprise Array shared between many hosts. When using Array Based technologies, during the synchronization of large number of LUNs, a noticeable impact will be observed on the Primary Array thus affecting every attached host on the array. However when doing Flashsnap, the impact would be only to the host running the flashsnap operation rather than the primary array. </li><li>The source and target volumes need not be of the same configuration and layout. Unlike Array based technologies, one can have different LUN configurations and also different sizing.</li><li>One can deploy a Flashsnap solution in less than 30 minutes. It is easy to implement and works well if configured and managed correctly (like all good solutions done right the first time).</li></ul><span style="font-size:130%;"><span style="font-weight: bold;">System Configuration </span></span><br /><br />The system configuration and specifications as was used in this article is documented below.<br /><ul><li>Platform – SunFire E4900 with 8 CPU’s and 32GB of RAM</li><li>OS Version – Solaris 9 </li><li>Database – Oracle 9204</li><li>Veritas Volume Manager – Storage Foundation for Oracle Enterprise Edition 4.0 (MP2)</li><li>HBA’s – 4 Dual Port Emulex LP10000 (64 Bit/66 MHz), Driver version 6.01f installed on 66Mhz slots installed in the 2 I/O boards on the E4900.</li><li>Number of volumes requiring backup - 18</li><li>Number of LUNs – 48 </li><li>Total size – 3.2 TB</li><li>Storage Array for Source LUNs – Hitachi 9585 using 72GB 10K RPM FCAL drives in a Raid 1 configuration.</li><li>Storage Array for SNAP LUNs – Hitachi 9585 using 143GB 10K RPM FCAL drives in a Raid 5 Configuration (4D+1).</li></ul><span style="font-size:130%;"><span style="font-weight: bold;">License Check<br /><br /></span></span>The first task would be to check the License for Flashsnap. Both Flashsnap and FastResync need to be enabled. The Flashsnap license also allows the vxdg split/join command to be performed allowing volumes to be split onto a different disk group.<br /><br /><span style="font-size:85%;"><span style="font-family:courier new;">bash-2.03# /opt/VRTSvlic/bin/vxlicrep</span><br /><br /><span style="font-family:courier new;">VERITAS License Manager vxlicrep utility version 3.02.005</span><br /><span style="font-family:courier new;">Copyright (C) 1996-2004 VERITAS Software Corp. All Rights reserved.</span><br /><br /><span style="font-family:courier new;">Creating a report on all VERITAS products installed on this system</span><br /><span style="font-family:courier new;">~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~</span><br /><span style="font-family:courier new;">-----------------***********************-----------------</span><br /><br /><span style="font-family:courier new;"> License Key = 4251216366621396656112014869701</span><br /><span style="font-family:courier new;"> Product Name = Warning!!! Key not associated with a valid product</span><br /><br /><span style="font-family:courier new;"> Key = Valid</span><br /><span style="font-family:courier new;"> License Type = PERMANENT_NODE_LOCK</span><br /><span style="font-family:courier new;"> Node Lock Type = (Hostid and Architecture ID)</span><br /><span style="font-family:courier new;"> Editions Product = YES</span><br /><br /><span style="font-family:courier new;">Features :=</span><br /></span> <span style="color: rgb(255, 0, 0);font-family:courier new;font-size:85%;" >POINT_IN_TIME = Enabled</span><span style="font-size:85%;"><br /><br /><span style="font-family:courier new;">-----------------***********************-----------------</span><br /><br /><span style="font-family:courier new;"> License Key = 65879700989905626340292027039</span><br /><span style="font-family:courier new;"> Product Name = VERITAS Volume Manager</span><br /><span style="font-family:courier new;"> Key = Valid</span><br /><span style="font-family:courier new;"> License Type = PERMANENT_NODE_LOCK</span><br /><span style="font-family:courier new;"> Node Lock Type = (Hostid and Architecture ID)</span><br /><br /><span style="font-family:courier new;">Features :=</span><br /></span> <span style="color: rgb(255, 0, 0);font-family:courier new;font-size:85%;" >FASTRESYNC = Enabled</span><br /><br /><span style="font-size:130%;"><span style="font-weight: bold;">VxVM Tunables</span></span><br /><br />In order to ensure that Veritas Volume Manager is adequately tuned to meet enterprise demands, the below variables were set.<br /><br />/etc/system changes from default:<br /><br /><span style="font-size:85%;"><span style="font-family:courier new;">set maxphys=8388608</span><br /><span style="font-family:courier new;"></span><span style="font-family:courier new;"></span><br /><span style="font-family:courier new;">set vxio:vol_maxio=16384</span><br /><br /><span style="font-family:courier new;"></span><span style="font-family:courier new;">set vxio:vol_maxioctl=131072</span><br /><br /><span style="font-family:courier new;"></span><span style="font-family:courier new;">set vxio:vol_maxspecialio=16384</span><br /><br /><span style="font-family:courier new;"></span><span style="font-family:courier new;">set vxio:vol_default_iodelay=10</span><br /><br /><span style="font-family:courier new;"></span><span style="font-family:courier new;">set vxio:voliomem_chunk_size=131072</span><br /><br /><span style="font-family:courier new;"></span><span style="font-family:courier new;">set vxio:voliomem_maxpool_sz=134217728</span><br /></span><br />Tthe volpagemod_max_memsz variable is set to 1024m in a startup script.<br /><br /><span style="font-size:85%;"><span style="font-family:courier new;">/usr/sbin/vxtune volpagemod_max_memsz 1024m</span></span><br /><br />The vxiod parameter is set to 32 by editing the vxvx-startup2 init script.<br /><br /><span style="font-size:130%;"><span style="font-weight: bold;">Emulex Configuration changes</span></span><br /><br /><span style="font-size:85%;"><span style="font-family:courier new;">lun-queue-depth=30;</span><br /><span style="font-family:courier new;">tgt-queue-depth=256;</span><br /><span style="font-family:courier new;">num-iocbs=2048;</span><br /><span style="font-family:courier new;">num-bufs=1024;</span></span><br /><br /><span style="font-size:130%;"><span style="font-weight: bold;">Storage Connectivity<br /><br /></span></span>The SAN was rated at 2Gbit/sec and the connectivity is as shown below.<br /><br />Primary Storage - lpfc0 and lpfc1 were on IO Board 0 and lpfc2 and lpfc3 were on IO Board 1.<br /><br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgUF5o602HxwsPYDQyZWMAfuITh38nllAPq4Di7wGrFXeIv3v_mjGOCfO2JJMfNHDQcaYAzI2OSUQal9mpgsR4m4gpJBvQ1hslHA4z7YtH9Dhcpx332G-MuREShIABagTxe4TS8c1koy4Eg/s1600-h/untitled.GIF"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgUF5o602HxwsPYDQyZWMAfuITh38nllAPq4Di7wGrFXeIv3v_mjGOCfO2JJMfNHDQcaYAzI2OSUQal9mpgsR4m4gpJBvQ1hslHA4z7YtH9Dhcpx332G-MuREShIABagTxe4TS8c1koy4Eg/s400/untitled.GIF" alt="" id="BLOGGER_PHOTO_ID_5183567250938047218" border="0" /></a>Backup Storage is connected in an exactly similar manner except that dedicated HBA’s were numbered lpfc4 to lpfc7.<br /><br /><span style="font-size:130%;"><span style="font-weight: bold;">Flashsnap Configuration</span></span><br /><br />Henceforth the volumes to be backed up are called Primary Volumes and the Mirror Copies of the volumes are called Backup Volumes. Same convention for Primary and Backup LUNs.<br />The basic configuration steps are<br /><br /><ul><li>Ensure that the host sees the backup LUNs as presented by the Storage.</li><li>Configure the backup LUNs and add them to the same diskgroup as the Primary Volumes.</li><li>Create Backup Volumes using the backup LUNs as of the same size as the Primary Volumes.</li><li>Prepare the Primary and Backup Volumes for Flashsnap</li><li>Perform the initial sync.</li><li>Split the Backup Volumes onto a different diskgroup and import on the media server for backup.</li><li>Hence going forward, do a incremental refresh as required. This would entail importing the Backup Diskgroup and joining it to the Primary Diskgroup to perform the refresh.</li></ul><span style="font-weight: bold;">Step1 and Step2 –</span><br /><br />It is assumed that Step 1 (Host sees backup LUNs) and Step 2 (Backup LUNs are initialized and added to same diskgroup as Primary Volumes) is completed successfully. For the sake of clarity, the Backup LUNs are initialized and added to the Primary Diskgroup using ‘SN’ as the prefix for the disk name.<br /><br />For e.g.<br /><br /><span style="font-size:85%;"><span style="font-family:courier new;">root@snaptest:> vxdisk list |more<br /><br /></span><span style="font-family:courier new;">DEVICE TYPE DISK GROUP STATUS</span><br /><span style="font-family:courier new;">c2t18d0s2 auto:cdsdisk testdg00 testdg online</span><br /><span style="font-family:courier new;">c2t18d1s2 auto:cdsdisk testdg01 testdg online</span><br /><span style="font-family:courier new;">c2t18d2s2 auto:cdsdisk testdg02 testdg online</span><br /><span style="font-family:courier new;">c2t18d3s2 auto:cdsdisk testdg03 testdg online</span><br /><span style="font-family:courier new;">c2t18d4s2 auto:cdsdisk testdg04 testdg online</span><br /><span style="font-family:courier new;">c2t18d5s2 auto:cdsdisk testdg05 testdg online</span><br /><span style="font-family:courier new;">c2t18d6s2 auto:cdsdisk testdg06 testdg online</span><br /><span style="font-family:courier new;">~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~</span><br /><span style="font-family:courier new;">c8t40d71s2 auto:cdsdisk SNtestdg85 testdg online</span><br /><span style="font-family:courier new;">c8t40d72s2 auto:cdsdisk SNtestdg72 testdg online</span><br /><span style="font-family:courier new;">c8t40d73s2 auto:cdsdisk SNtestdg73 testdg online</span><br /><span style="font-family:courier new;">~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~</span></span><br /><br />Here disks starting with testdg refer to the Primary LUNs and the disks starting with SN refer to the Backup LUNs.<br /><br /><span style="font-weight: bold;">Step 3:</span><br /><br />Step 3 is to create the Backup Volumes with the same size as that of the Primary Volumes.<br />In the example shown below – one of the Primary Volumes is a 4 Column Stripe,<br /><br /><span style="font-size:85%;"><span style="font-family:courier new;">v test05 - ENABLED ACTIVE 556785664 SELECT test05-01 fsgen</span><br /><span style="font-family:courier new;">pl test05-01 test05 ENABLED ACTIVE 556785664 STRIPE 4/2048 RW</span><br /><span style="font-family:courier new;">sd testtdg22-01 test05-01 testtdg22 0 139196416 0/0 c5t16d22 ENA</span><br /><span style="font-family:courier new;">sd testtdg23-01 test05-01 testtdg23 0 139196416 1/0 c5t16d23 ENA</span><br /><span style="font-family:courier new;">sd testtdg24-01 test05-01 testtdg24 0 139196416 2/0 c5t16d24 ENA</span><br /><span style="font-family:courier new;">sd testtdg25-01 test05-01 testtdg25 0 139196416 3/0 c5t16d25 ENA</span><br /></span><br />While one can create the backup volume of any layout, it is ideal if the same layout were maintained. It is important to note that the Backup Volumes are created using the Backup LUNs and using the –U fsgen option. The source volumes would also ideally be created with the –U fsgen option.<br /><br />The below volume SNtest05 is going to be Backup Volume for Primary Volume test05. All Primary volumes which need to be backed using Flashsnap need to have Backup Volumes similarly created.<br /><br /><span style="font-size:85%;"><span style="font-family:courier new;">v SNtest05 - ENABLED ACTIVE 556785664 SELECT SNtest05-01 fsgen</span><br /><span style="font-family:courier new;">pl SNtest05-01 SNtest05 ENABLED ACTIVE 556785664 STRIPE 4/2048 RW</span><br /><span style="font-family:courier new;">sd SNtesttdg22-01 SNtest05-01 SNtesttdg22 0 139196416 0/0 c8t40d22 ENA</span><br /><span style="font-family:courier new;">sd SNtesttdg23-01 SNtest05-01 SNtesttdg23 0 139196416 1/0 c8t40d23 ENA</span><br /><span style="font-family:courier new;">sd SNtesttdg24-01 SNtest05-01 SNtesttdg24 0 139196416 2/0 c8t40d24 ENA</span><br /><span style="font-family:courier new;">sd SNtesttdg25-01 SNtest05-01 SNtesttdg25 0 139196416 3/0 c8t40d25 ENA</span><br /></span><br /><span style="font-weight: bold;">Step 4:</span><br /><br />Step 4 is to prepare the Primary and Backup Volumes for Flashsnap. This basically involves creating DCO logs for the volumes. Primary volumes need to use a disk belonging to the Primary Disks for DCO logs and similarly Backup Volumes need to use a disk belonging to the Backup Disks for the DCO logs. In order to meet performance requirements, it is ideal if disks were dedicated for the sole purpose of DCO logs only. These disks would be reserved and not used in any volume creation.<br />In the example given below, testdg06 is the disk going to be used for DCO logs for all the Primary volumes and SNtestdg87 is used for the Backup Volumes.<br /><br /><span style="font-size:85%;"><span style="font-family:courier new;">root@snaptest:> vxdisk –g testdg list |more</span><br /><span style="font-family:courier new;">DEVICE TYPE DISK GROUP STATUS</span><br /><span style="font-family:courier new;">c2t18d0s2 auto:cdsdisk testdg00 testdg online</span><br /><span style="font-family:courier new;">c2t18d1s2 auto:cdsdisk testdg01 testdg online</span><br /><span style="font-family:courier new;">c2t18d6s2 auto:cdsdisk testdg06 testdg online</span><br /><span style="font-family:courier new;">c2t18d6s2 auto:cdsdisk testdg06 testdg online reserved</span><br /><span style="font-family:courier new;">~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~</span><br /><span style="font-family:courier new;">c8t40d71s2 auto:cdsdisk SNtestdg71 testdg online reserved</span><br /><span style="font-family:courier new;">c8t40d72s2 auto:cdsdisk SNtestdg72 testdg online</span><br /><span style="font-family:courier new;">c8t40d73s2 auto:cdsdisk SNtestdg73 testdg online</span><br /><span style="font-family:courier new;">~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~</span><br /></span>In order to prepare a volume for snap, we run the vxsnap prepare command as shown below:<br /><br /><span style="font-style: italic;">Primary Volume</span><br /><br /><span style="font-size:85%;"><span style="font-family:courier new;">bash-2.03# vxsnap -g testdg prepare test05 regionsize=32k alloc=testdg06</span><br /><br /><span style="font-family:courier new;">root@snaptest:> vxprint -hrt test05</span><br /><span style="font-family:courier new;">Disk group: testdg</span><br /><span style="font-family:courier new;"></span><br /><span style="font-family:courier new;">dm testdg22 c2t18d22s2 auto 2048 139197696 -</span><br /><span style="font-family:courier new;">dm testdg23 c2t18d23s2 auto 2048 139197696 -</span><br /><span style="font-family:courier new;">dm testdg24 c2t18d24s2 auto 2048 139197696 -</span><br /><span style="font-family:courier new;">dm testdg25 c2t18d25s2 auto 2048 139197696 -</span><br /><br /><span style="font-family:courier new;">v test05 - ENABLED ACTIVE 556785664 SELECT test05-01 fsgen</span><br /><span style="font-family:courier new;">pl test05-01 test05 ENABLED ACTIVE 556785664 STRIPE 4/2048 RW</span><br /><span style="font-family:courier new;">sd testdg22-01 test05-01 testdg22 0 139196416 0/0 c2t18d22 ENA</span><br /><span style="font-family:courier new;">sd testdg23-01 test05-01 testdg23 0 139196416 1/0 c2t18d23 ENA</span><br /><span style="font-family:courier new;">sd testdg24-01 test05-01 testdg24 0 139196416 2/0 c2t18d24 ENA</span><br /><span style="font-family:courier new;">sd testdg25-01 test05-01 testdg25 0 139196416 3/0 c2t18d25 ENA</span><br /><span style="font-weight: bold;font-family:courier new;" >dc test05_dco test05 test05_dcl</span><br /><span style="font-family:courier new;">v test05_dcl - ENABLED ACTIVE 76704 SELECT - gen</span><br /><span style="font-family:courier new;">pl test05_dcl-01 test05_dcl ENABLED ACTIVE 76704 CONCAT - RW</span><br /><span style="font-family:courier new;">sd testdg06-08 test05_dcl-01 testdg06 360944 76704 0 c2t18d7 ENA</span><br /><br /></span><br />Note the DCL plex test05_dcl created after running the vxsnap prepare command.<br /><br /><span style="font-style: italic;">Backup Volume</span><br /><br /><span style="font-size:85%;"><span style="font-family:courier new;">bash-2.03# vxsnap -g testdg prepare SNtest05 regionsize=32k alloc=SNtestdg85</span><br /><br /><span style="font-family:courier new;">root@snaptest:> vxprint -hrt SNtest05</span><br /><span style="font-family:courier new;">Disk group: SNtestdg</span><br /><span style="font-family:courier new;"></span><br /><span style="font-family:courier new;">dm SNtestdg22 c8t40d22s2 auto 2048 139197696 -</span><br /><span style="font-family:courier new;">dm SNtestdg23 c8t40d23s2 auto 2048 139197696 -</span><br /><span style="font-family:courier new;">dm SNtestdg24 c8t40d24s2 auto 2048 139197696 -</span><br /><span style="font-family:courier new;">dm SNtestdg25 c8t40d25s2 auto 2048 139197696 -</span><br /><br /><span style="font-family:courier new;">v SNtest05 - ENABLED ACTIVE 556785664 SELECT SNtest05-01 fsgen</span><br /><span style="font-family:courier new;">pl SNtest05-01 SNtest05 ENABLED ACTIVE 556785664 STRIPE 4/2048 RW</span><br /><span style="font-family:courier new;">sd SNtestdg22-01 SNtest05-01 SNtestdg22 0 139196416 0/0 c8t40d22 ENA</span><br /><span style="font-family:courier new;">sd SNtestdg23-01 SNtest05-01 SNtestdg23 0 139196416 1/0 c8t40d23 ENA</span><br /><span style="font-family:courier new;">sd SNtestdg24-01 SNtest05-01 SNtestdg24 0 139196416 2/0 c8t40d24 ENA</span><br /><span style="font-family:courier new;">sd SNtestdg25-01 SNtest05-01 SNtestdg25 0 139196416 3/0 c8t40d25 ENA</span><br /><span style="font-family:courier new;">dc SNtest05_dco SNtest05 SNtest05_dcl</span><br /><span style="font-family:courier new;">v SNtest05_dcl - ENABLED ACTIVE 76704 SELECT - gen</span><br /><span style="font-family:courier new;">pl SNtest05_dcl-01 SNtest05_dcl ENABLED ACTIVE 76704 CONCAT - RW</span><br /><span style="font-family:courier new;">sd SNtestdg71-01 SNtest05_dcl-01 SNtestdg71 360944 76704 0 c8t40d71 ENA</span></span><br /><br />Using a regionsize of 32k allows for faster resynchronizations though the initial sync may take more time.<br /><br />In order to verify that Fast Resync and Instant Snapshot has been enabled successfully for the volume, the below tests can be done.<br /><br /><span style="font-size:85%;"><span style="font-family:courier new;">bash-2.03# vxprint -g testdg -F%fastresync test05</span><br /><span style="font-family:courier new;">on</span><br /><br /><span style="font-family:courier new;">bash-2003# vxprint -g testdg -F%instant test05</span><br /><span style="font-family:courier new;">on</span></span><br /><br /><span style="font-weight: bold;">Step 5:</span><br /><br />Step 5 is to run the initial sync which will associate the Backup Volumes with the Primary Volumes and do a mirror copy of the Primary Volume onto the Backup Volume. It is performed using the vxsnap command.<br />In this example, the iosize parameter has been set to 12M which is aggressive. The default is 1M. Running at 12M will impose a significant overhead on the system during the time the sync operation runs – especially if synchronizing a large number of volumes at the same time.<br /><br /><span style="font-size:85%;"><span style="font-family:courier new;">bash-2.03# vxsnap -g testdg -o iosize=12m make source=test05/snapvol=SNtest05</span></span><br /><br />The above command will create a snap record and associate the Primary with the Backup Volume. It will automatically run in the background. Monitoring on the progress can be done using vxtask list or vxsnap syncwait.<br /><br /><span style="font-size:85%;"><span style="font-family:courier new;">bash-2.03# vxsnap -g testdg syncwait SNtest05</span></span><br /><br />The syncwait command will return to the prompt only when the snap process has completed.<br />A vxprint output of the Primary Volume or the Backup Volume will now show the sp record.<br /><br /><span style="font-size:85%;"><span style="font-family:courier new;">v test05 - ENABLED ACTIVE 556785664 SELECT test05-01 fsgen</span><br /><span style="font-family:courier new;">pl test05-01 test05 ENABLED ACTIVE 556785664 STRIPE 4/2048 RW</span><br /><span style="font-family:courier new;">sd testtdg22-01 test05-01 testdg22 0 139196416 0/0 c2t18d22 ENA</span><br /><span style="font-family:courier new;">sd testtdg23-01 test05-01 testdg23 0 139196416 1/0 c2t18d23 ENA</span><br /><span style="font-family:courier new;">sd testtdg24-01 test05-01 testdg24 0 139196416 2/0 c2t18d24 ENA</span><br /><span style="font-family:courier new;">sd testtdg25-01 test05-01 testdg25 0 139196416 3/0 c2t18d25 ENA</span><br /><span style="font-family:courier new;">dc test05_dco test05 test05_dcl</span><br /><span style="font-family:courier new;">v test05_dcl - ENABLED ACTIVE 76704 SELECT - gen</span><br /><span style="font-family:courier new;">pl test05_dcl-01 test05_dcl ENABLED ACTIVE 76704 CONCAT - RW</span><br /><span style="font-family:courier new;">sd testtdg07-08 test05_dcl-01 testtdg07 360944 76704 0 c2t18d7 ENA</span><br /><span style="font-weight: bold;font-family:courier new;" >sp SNtest05_snp test05 test05_dco</span><br /></span><br /><span style="font-weight: bold;">Step 6:</span><br /><br />Step 6 is now to split the Primary diskgroup and separate the Backup Volumes into a different diskgroup. Doing so allows the Backup Diskgroup (Volumes) to be deported and hence imported on the Backup/Media Server.<br /><br /><span style="font-size:85%;"><span style="font-family:courier new;">bash-2.03# vxdg split testdg SNtestdg SNtest05</span></span><br /><br />This command will move the volume SNtest05 to a new diskgroup SNtestdg. There are multiple ways to do a split – either using the –o expand option or specifying all the volumes in the command<br /><br /><span style="font-size:85%;"><span style="font-family:courier new;">bash-2.03# vxdg split testdg SNtestdg SNtest05 SNvol2 SNvol3 etc</span></span><snapvol2><snapvol3><br /><br />or<br /><br /><span style="font-size:85%;"><span style="font-family:courier new;">bash-2.03# vxdg –o expand split testdg SNtestdg SNtestdg71<br /><br /></span></span></snapvol3></snapvol2><span style="font-size:85%;"><span style="font-family:courier new;">bash-2.03# vxdg deport SNtestdg</span></span><br /><snapvol2><snapvol3><snap dco="" disk=""><br />The –o expand option will then split the Primary Diskgroup and move all volumes using the <snap dco="" disk=""> into the SNtestdg diskgroup – in this case the disk is SNtestdg71.<br /><br />Once this is complete, then a deport of SNtestdg can be performed and SNtestdg can be imported on the Media Server for backup to Tape.<br /><br /><span style="font-weight: bold;">Step 7</span><br /><br />For performing an incremental refresh of Backup Volumes from Primary Volumes, the Backup Diskgroup SNtestdg needs to be imported on the Primary Host, joined to testdg and the refresh performed.<br /><br /><span style="font-size:85%;"><span style="font-family:courier new;">bash-2.03# vxdg import SNtestdg<br /><br /></span><span style="font-family:courier new;">bash-2.03# vxdg join SNtestdg testdg</span><br /><span style="font-family:courier new;"><br />bash-2.03# vxvol –g testdg startall</span><br /><span style="font-family:courier new;"><br />bash-2.03# vxsnap –g testdg refresh SNtest05 source=test05</span><br /><span style="font-family:courier new;"><br />bash-2.03# vxsnap –g testg syncwait SNtest05</span><br /><br /></span><span style="font-weight: bold;">Proper Backup of an Oracle Database – sequence</span><br /><br />In order to take a proper and valid backup of an Oracle database, the below is the sequence of operations to be performed. In order for this to be successful, the archive logs must be placed on a dedicated mount point.<br /><br /></snap></snap></snapvol3></snapvol2><ul><li><snapvol2><snapvol3><snap dco="" disk=""><snap dco="" disk="">Import and join the Backup Diskgroup to the Primary Diskgroup.</snap></snap></snapvol3></snapvol2></li><li><snapvol2><snapvol3><snap dco="" disk=""><snap dco="" disk="">Alter all tablespaces except the temporary tablespace into begin backup.</snap></snap></snapvol3></snapvol2></li><li><snapvol2><snapvol3><snap dco="" disk=""><snap dco="" disk="">Run a refresh of all the volumes including archive logs and wait till it completes.</snap></snap></snapvol3></snapvol2></li><li><snapvol2><snapvol3><snap dco="" disk=""><snap dco="" disk="">Alter all tablespaces except the temporary tablespace into end backup.</snap></snap></snapvol3></snapvol2></li><li><snapvol2><snapvol3><snap dco="" disk=""><snap dco="" disk="">Sync the archive logs (only) mount point again and wait till it completes.</snap></snap></snapvol3></snapvol2></li><li><snapvol2><snapvol3><snap dco="" disk=""><snap dco="" disk="">Split the Backup Volumes into the Backup Diskgroup and deport the Backup Diskgroup.</snap></snap></snapvol3></snapvol2></li><li><snapvol2><snapvol3><snap dco="" disk=""><snap dco="" disk="">Import on Media server and run backup to tape.</snap></snap></snapvol3></snapvol2></li></ul><snapvol2><snapvol3><snap dco="" disk=""><snap dco="" disk=""><br />The above sequence can be done everyday as part of the backup cycle.<br /><br /><span style="font-weight: bold;">Performance Impact to the host during the initial sync and further refreshes</span><br /><br />The entire refresh of 3.2TB took around 6 hrs thus showing an average write throughput of 150MB/sec with a peak of around 250MB/sec. The system was showing an aggregate throughput of 300-500MB/sec (read+write). However there was a considerable load on the system with a run queue of around 20 for the period of the sync. Further refreshes for a database of 1.2TB (used size) with a 50% read-write ratio (150GB of redo/day) takes around 30 minutes on an average.<br /><br /><br /><br /><br /></snap></snap></snapvol3></snapvol2>Unknownnoreply@blogger.com4tag:blogger.com,1999:blog-1550840387946350137.post-49694394425588294092008-02-16T10:52:00.000-08:002008-02-16T20:31:47.391-08:00Oracle - OFA and relevance to RAC and modern timesOracle Real Application Clusters (RAC) is gaining huge popularity due to its promise of high performance and availability at a reasonable cost.<br /><br />However deploying and supporting a RAC environment is no easy task. To support RAC technologies, today’s DBA has to be knowledgeable not just about the database but also storage, cluster technology and have sound networking concepts. It is not practical to expect that all support staff have the same degree of skill and knowledge.<br /><br />RAC being a Shared Storage solution, great emphasis is placed on suitable Shared Storage and the corresponding file system layout (if using a Cluster File system). The IO attributes of the Shared Storage solution can more or less make or break the solution – especially when deploying RAC for datawarehousing.<br /><br />To this effect, I have been working on a document to create a reference configuration of the file system layout in RAC environment. The objectives are a solution which is<br /><ul><li>Easy to implement and support</li><li>High performance</li><li>High Availability</li><li>Easy to backup and restore</li></ul>Oracle’s OFA may not be suitable for an Enterprise environment wherein the database sizes are bigger and support roles are highly compartmentalized. Also OFA seems to be overtly complicated when deploying - for e.g. why would you ever have multiple versions of oracle binaries on a production environment?<br /><br />So any proposed solution should be easy to support even for a beginner in RAC technologies. I have more or less completed the document and hope to post it by the end of March.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1550840387946350137.post-38290255523269988512008-01-08T20:47:00.000-08:002008-01-08T22:01:54.377-08:00NFS and Oracle - Mount options - noac, actimeo, forcedirectio, et al.While doing my first install of Oracle RAC on Solaris using NFS as the Clustered Filesystem, I took the opportunity to test the various mount options specified by oracle (Doc ID: Note:359515.1).<br /><br />Let us take a look at the options specified<br /><br /><span style="font-weight: bold;">1. Oracle binaries - </span><br /><br />rw,bg,hard,nointr,rsize=32768, wsize=32768,tcp,noac,vers=3,suid -<br /><br />I could understand rw,bg,hard,nointr, tcp,suid,vers=3. But noac, rsize and wsize did not make much sense to me.<br /><br />rsize and wsize should be considerably higher than 32K. A Sun NFS server (While not supported by Oracle) supports upto 1MB. Still anyway, Oracle does not complain if you set it higher. However noac seems to be pretty much backword. From the manpage of mount_nfs on Solaris 10,<br /><br /><span style="font-style: italic;">noac</span><br /><br /><span style="font-style: italic;"> Suppress data and attribute caching. The data cach-</span><br /><span style="font-style: italic;"> ing that is suppressed is the write-behind. The</span><br /><span style="font-style: italic;"> local page cache is still maintained, but data</span><br /><span style="font-style: italic;"> copied into it is immediately written to the server.</span><br /><br /><span style="font-style: italic;">Setting the noac option also disables attribute caching, but has the further effect of disabling client write caching. While this guarantees that data written by an application is written directly to a server, where it can be viewed immediately by other clients, it has a significant adverse effect on client write performance. Data written into memory-mapped file pages (mmap(2)) are not written directly to this server.</span><br /><br />noac makes sense if you have a single Oracle home for all your RAC instances. I wonder how many dba's would install RAC that way. It takes away high-availability as an option completely.<br /><br />The performance impact of noac is significant - install/patch etc takes forever as every write has to synced to disk before it can proceed.<br /><br />I started installing RAC with noac and found it incredibly slow going. I did some testing and this is what I found - unzipping the 1.2GB 10.2.0.3 patchset for the DB (Solaris) using the various options specified. unzipping tests both the read and write operations. While the system I was testing is a 280R (a relic), it is still a good system for testing purposes.<br /><br /><span style="font-size:100%;"><span style="font-weight: bold;">Using noac (oracle recommendation)</span><br /></span><span style="font-size:100%;"><br />rw,bg,vers=4,proto=tcp,hard,intr,rsize=524288,wsize=524288,noac, - 33 minutes, 34 secs<br /></span><span style="font-size:100%;"><br /><span style="font-weight: bold;">Using defaults (without noac, forcedirectio or actimeo)</span><br /><br />rw,bg,vers=4,proto=tcp,hard,intr,rsize=524288,wsize=524288 - 7 minutes,44 secs<br /></span><br />I also tested with forcedirectio and actimeo. From the manpage of mount_nfs,<br /><br /><span style="font-style: italic;">forcedirectio | noforcedirectio</span><br /><br /><span style="font-style: italic;"> If forcedirectio is specified, then for the duration</span><br /><span style="font-style: italic;"> of the mount, forced direct I/O is used. If the</span><br /><span style="font-style: italic;"> filesystem is mounted using forcedirectio, data is</span><br /><span style="font-style: italic;"> transferred directly between client and server, with</span><br /><span style="font-style: italic;"> no buffering on the client. If the filesystem is</span><br /><span style="font-style: italic;"> mounted using noforcedirectio, data is buffered on</span><br /><span style="font-style: italic;"> the client. forcedirectio is a performance option</span><br /><span style="font-style: italic;"> that is of benefit only in large sequential data</span><br /><span style="font-style: italic;"> transfers. The default behavior is noforcedirectio.</span><br /><br /><span style="font-style: italic;">actimeo=n</span><br /><br /><span style="font-style: italic;"> Set min and max times for regular files and direc-</span><br /><span style="font-style: italic;"> tories to n seconds. See "File Attributes," below,</span><br /><span style="font-style: italic;"> for a description of the effect of setting this</span><br /><span style="font-style: italic;"> option to 0.</span><br /><br /><span style="font-style: italic;"> Setting actimeo=0 disables attribute caching on the client.</span><br /><span style="font-style: italic;"> This means that every reference to attributes is satisfied</span><br /><span style="font-style: italic;"> directly from the server though file data is still cached.</span><br /><span style="font-style: italic;"> While this guarantees that the client always has the latest</span><br /><span style="font-style: italic;"> file attributes from the server, it has an adverse effect on</span><br /><span style="font-style: italic;"> performance through additional latency, network load, and</span><br /><span style="font-style: italic;"> server load.</span><br /><br /><span style="font-weight: bold;">Using actimeo=0</span><br /><div style="text-align: left;"><span style="font-size:85%;"><br /><span style="font-size:100%;">rw,bg,vers=4,proto=tcp,hard,intr,rsize=524288,wsize=524288,actimeo=0 - 12 minutes,28 secs</span></span><br /><br /><span style="font-weight: bold;">Using forcedirectio and actimeo=0</span><br /><span style="font-size:85%;"><br /><span style="font-size:100%;">rw,bg,vers=4,proto=tcp,hard,intr,rsize=524288,wsize=524288, forcedirectio, actimeo=0 - 18 minutes,12 secs</span></span><span style="font-size:100%;"><br /></span></div><br /><span style="font-weight: bold;">Using forcedirectio and noac</span><br /><span style="font-size:85%;"><br /><span style="font-size:100%;">rw,bg,vers=4,proto=tcp,hard,intr,rsize=524288,wsize=524288, forcedirectio, noac - 19 minutes,10 sec<br /></span></span><br />Obviously the winner is to enable write caching on the client (default options). In case you are sharing Oracle homes, then you would need to enable noac to ensure that writes are consistent and all attributes are referred back to the NFS server.<br /><br />However, if the Oracle homes are independent, I would assume it is perfectly safe to use the default options (albeit with higher rsize and wsize values).<br /><br />Now moving onto the data files, the options specified are<br /><br /><span style="font-weight: bold;">2. Oracle Data Files</span><br /><br />rw,bg,hard,nointr,rsize=32768, wsize=32768,tcp,noac,forcedirectio, vers=3,suid<br /><br />Again except noac and forcedirectio, the other options are okay. What is odd is that forcedirectio and noac are doing the same thing (direct writes to the files bypassing buffer cache).<br /><br />noac also disables attribute caching.<br /><br />A more sensible option would be forcedirectio and actimeo=0. Forcedirectio enables direct io and actimeo=0 disables file attribute caching. noac is backwords and I wonder why Oracle still insists on it. The problem is that if you do not enable noac, then your instance will not start (complains about the NFS options). While you can do all the installs, dbca will fail to startup the instance with errors as below.<br /><br /><span style="font-style: italic;">WARNING:NFS file system /RACS/und1 mounted with incorrect options</span><br /><span style="font-style: italic;">WARNING:Expected NFS mount options:rsize=32768,wsize=32768,hard,noac</span><br /><span style="font-style: italic;">Thu Dec 27 12:36:17 2007</span><br /><span style="font-style: italic;">Errors in file /RACS/orabase/racshome/admin/bdump/racs1_dbw0_17195.trc:</span><br /><span style="font-style: italic;">ORA-01157: cannot identify/lock data file 2 - see DBWR trace file</span><br /><span style="font-style: italic;">ORA-01110: data file 2: '/RACS/und1/undotbs01.dbf'</span><br /><span style="font-style: italic;">ORA-27054: NFS file system where the file is created or resides is not mounted with correct options</span><br /><br />Now about the options for CRS and voting disks<br /><br /><span style="font-weight: bold;">3. CRS and Voting disks</span><br /><br />rw,bg,hard,nointr,rsize=32768, wsize=32768,tcp,vers=3,noac, forcedirectio<br /><br />Same argument as before - noac and forcedirectio are doing the same thing. It would be better to use forcedirectio and actimeo=0.<br /><br />Oracle seems to have gotten it right with Linux with actimeo=0, however I do not see an option for forcedirectio for nfs (neither ver3 nor ver4).Unknownnoreply@blogger.com4tag:blogger.com,1999:blog-1550840387946350137.post-42960455686317113262007-11-17T10:45:00.001-08:002007-11-18T17:55:40.233-08:00NAS devices for Oracle - Netapp versus HP Enterprise File Services Clustered GatewayContinuing on my earlier topic about Oracle and NAS, the ideal NAS device for Oracle would support just about the following (from a tactical perspective).<br /><ul><li>Supported by Oracle<br /></li><li>Symmetric active controllers so that load-balancing, linear scaling and HA are possible.</li><li>10G ethernet capabilities with Jumbo frames.<br /></li><li>Supports both Unix and Linux clients.<br /></li><li>Support Direct and Async I/O.</li><li>Ability to map luns as a initiator (Leverage existing SAN infrastructure).<br /></li><li>Direct I/O to luns bypassing the NAS cache as an option. So the NAS basically functions like a gateway device doing protocol conversion only.</li><li>Snapshots at the NAS level and not at the SAN level. This gives the ability to use different storage for backups and better controlled.</li><li>Ability to carve volumes from the luns in any manner as requested by the end-user (Raid 0, RAid 10, Raid 5 etc). End user specified Stripe Widths.<br /></li><li>Set the pre-fetch and write-back policies for the NFS filesystems on the NAS head.</li><li>Fine Tune the I/O aspects of both NAS and SAN components<br /></li><li>Not to say the least - easy to support and maintain.</li></ul>It goes without saying that any NAS should have a decent I/O scheduler sitting in front of the luns which aggressively sorts/merges I/O to reduce disk head movements.<br /><br />While Netapp cannot do much of the above, there is a new player in the market - the <a href="http://h18006.www1.hp.com/products/storageworks/efs/">HP Enterprise File Services Clustered Gateway</a>.<br /><br />The HP solution for Unix is built on the Suse Linux Platform using the Polyserve Cluster Filesystem. It is built to serve as a NAS head utilizing HP branded Storage Arrays (I assume they would support other Arrays - as long as drivers exist for Suse Linux). I personally feel this is a brilliant idea - marrying a cluster filesystem/volume manager with an OS and making a NAS head which can scale horizontally. The OS gives manageability + SAN capabilities out of the box and the cluster filesystem/volume manager gives horizontal scalability and custom volume management. While Linux is not the best performer when it comes to Networking (compared to Netapp or Solaris), with all the other goodness thrown in, it is a mighty force indeed.<br /><br />It meets most of the requirements as shown above, however I have not yet had the opportunity to test and see in a live environment. I think this would be a great product to watch out for.Unknownnoreply@blogger.com0