Performance Engineering and Capacity Planning: 2009

Monday, September 21, 2009

Download Link for the "Storage Design for Datawarehousing" presentation

It appears that scribd does not allow anonymous downloads any more. Hence on request, the "Storage Design for Datawarehousing" presentation can be downloaded in PDF format here.

Monday, July 20, 2009

Map Disk Block devices on a Linux host back to the SAN Lun

There was a question posted in Oracle-L - How can one map Disk Devices (On Linux) back to the SAN LUNs? This is a no brainer on Solaris virtue of persistent binding capabilities on the HBA along with hard-coding the luns in sd.conf, but Linux is a different ball game.

Most of you would wonder – why do I care to know about this mapping? You need to - because as a DBA if you are seeing high latencies for a specific data file(s), you would need to know to which LUN , the block device (containing the data file) ties back to on the array. SAN admins work with LUNs, not with system block devices.

In Linux, the block devices are named starting with "sd" with little information as to

1. Specific Adaptor/Port(HBA) through which it is mapped.

2. Array on which the LUN is carved out.

3. The specific LUN number on the Array

While it is not impossible to find out the above information, there is no standard interface which can help you determine the above information (I may be mistaken - there may be a tool out there which I am not aware of).

On Redhat 4.x and higher versions using the Kernel 2.6.x and higher, the sysfs implementation is standard. The /sys filesystem holds the key to identifying the above information.

Unfortunately, the /sys filesystem is not well documented and is full of shortcuts pointing back
and forth. In short it is a maze. But it is Linux and so we are not surprised. So we have to devise our own methods to overcome these shortcomings.

Let us use an example to illustrate how it can be done. Let us assume that our database is showing high latencies for datafiles residing on the block device sdf.

First you need to identify how many HBAs you have on the system and how many ports/HBA
(HBAs come in single port or dual port configurations).

You can identify this in multiple ways - I prefer to use lspci

$ /sbin/lspci |grep Fibre

0f:00.0 Fibre Channel: Emulex Corporation Zephyr-X LightPulse Fibre Channel Host Adapter (rev 02) -------------- > HBA 1 Port 1

0f:00.1 Fibre Channel: Emulex Corporation Zephyr-X LightPulse Fibre Channel Host Adapter (rev 02)-------------- > HBA 1 Port 2

11:00.0 Fibre Channel: Emulex Corporation Zephyr-X LightPulse Fibre Channel Host Adapter (rev 02) -------------- > HBA 2 Port 1

11:00.1 Fibre Channel: Emulex Corporation Zephyr-X LightPulse Fibre Channel Host Adapter (rev 02) -------------- > HBA 2 Port 2

As you can see, it shows that there are 2 Dual Port Emulex HBAs installed in the system. LUNs
can be visible to Linux through any or all of the 4 available ports.

So the block device sdf as seen by the OS is visible via one of these 4 available ports (paths). Next, we need to identify the specific port through which this LUN sdf is visible to the OS.

For that, we change directories to /sys/block and run a find.

$ cd /sys/block

$ find . -name device -exec ls -l {} \; |grep sdf

lrwxrwxrwx 1 root root 0 Jul 21 17:25 ./sdf/device -> ../../devices/pci0000:00/0000:00:04.0/0000:0f:00.0/host1/target1:0:0/1:0:0:4

or another way is to

$ cd /sys/block/sdf

$ ls -l device
lrwxrwxrwx 1 root root 0 Jul 21 17:25 device ->
../../devices/pci0000:00/0000:00:04.0/0000:0f:00.0/host1/target1:0:0/1:0:0:4

As you can see from the above, we can identify that sdf is visible to the OS via the HBA Port 0f:00.0 to the OS. 0f:00.0 as we have seen above is HBA 1 Port 1. This output has several more interesting tid-bits of information.

lrwxrwxrwx 1 root root 0 Jul 21 17:25 device -> ../../devices/pci0000:00/0000:00:04.0/0000:0f:00.0/host1/target1:0:0/1:0:0:4

The target is 0 from 1:0:0:4 or from target1:0:0.

We can also learn from the above output that this is LUN 4 from 1:0:0:4.

All the remains now is to identify the array which is presenting LUN 4 to the OS. This
information is available from the /proc interface.

$ cd /proc/scsi/lpfc

$ ls
1 2 3 4

Here the numbers 1-4 correspond to the specific HBA ports 1 - 4. We have identified that LUN sdf is on HBA 1 Port 1 and so we look into the contents of "1". We also know it is Target 0.

$ more 1
lpfc0t01 DID 610413 WWPN 50:06:0e:80:00:c3:bd:40 WWNN 50:06:0e:80:00:c3:bd:40
lpfc0t02 DID 610a13 WWPN 50:06:0e:80:10:09:d0:02 WWNN 50:06:0e:80:10:09:d0:02
lpfc0t00 DID 612413 WWPN 50:06:0e:80:10:09:d0:07 WWNN 50:06:0e:80:10:09:d0:07

We see that Target 0 is mapped to WWPN ending in 07. This is the Array WWPN from which the Storage Admin can identify the specific Array which is presenting LUN 4 (sdf) to the host.

So we have identified that the block device sdf is

1. Visible to the OS via HBA 1 Port 1.

2. Target 0 for the HBA.

3. LUN 4 as presented to the OS by the Array.

4. The array that is presenting LUN 4 to the OS is 50:06:0e:80:10:09:d0:07.

With this above information, your Storage Admin will get a headstart on fixing the problem for
you.

Monday, July 6, 2009

Storage Design for Datawarehousing

With all the talk about exadata and netezza going around, I gave a presentation on conventional storage design for Datawarehousing to our team. For a small to medium corporate DW, conventional storage arrays perform adequately well (if properly designed).

Storage Design for Datawarehousing

Storage Design for Datawarehousing

Storage Subsystem Design for Datawarehousing Array, Drive and RAID Selection Krishna Manoharan krishmanoh@gmail.com http://dsstos.blogspot.com 1 Again, what is this about? An attempt to show how to design a Storage Subsystem for an Datawarehouse Environment from a physical perspective. Aimed at conventional environments using standard devices such as Fibre Channel SAN arrays for Oracle Databases. The presentation will demonstrate the Array and Drive selection process using real life examples. You will be in for a few surprises! 2 Enterprise Business Intelligence (EBI) Most companies have multiple Oracle instances (such as ODS and DW) with an ETL Engine (Informatica) and Reporting tool (Business Objects) all rolled into an Enterprise Business Intelligence (EBI) Environment. ODS is the Operational Data Store (a near real time copy of the company's Transactional data) and DW is the Datawarehouse (a collection of aggregated corporate data). The ETL Engine (such as Informatica) transforms and loads data contained in the ODS into the DW. The Reporting Engine (such as Business Objects) reports off data from both the ODS and DW. This presentation covers the storage design for the DW. Typical size of an DW is around 5-10TB for a large software company. Though the typical Enterprise Warehouse is small in size, it is by no means less busy. 3 Enterprise Business Intelligence (EBI) – contd. Users Re po rts Reporting Engine Database Layer ODS One Way Replication from Source Systems DW Load ETL Engine Ex t ac tr HR Online Sales Click Stream ERP CRM Transaction Systems 4 Datawarehousing and the Storage Subsystem One of the biggest factors affecting performance in Datawarehousing is the Storage subsystem. Once the environment is live, it becomes difficult to change a storage subsystem or the layers within. So it is important to design, size and configure the Storage subsystem appropriately for any Datawarehousing Environment. 5 What is the Storage Subsystem? CPU Switch Memory System PCI The physical components of a conventional storage subsystem are System PCI Interface SAN Fabric Array Port n SAN Switch SAN Fabric Port 1 Port 2 Front End Ports to Host CPU Cache In this presentation, we talk about the Array component of the Storage Subsystem. Drives Array 6 IO Fundamentals from the Storage Subsystem READS are to the Storage Subsystem WRITES are Storage Subsystem IO in the simplest of terms is a combination of reads and writes. Reads and Writes can be Random or Sequential. 7 IO Fundamentals – contd. Random or Sequential is determined at the array level (Meaningful) Random Reads Random Read Hit - If present in the Array Cache, then occurs at wire speed. Random Read Miss - If not present in the Array Cache, hits the drives directly. This is the slowest IO operation. Sequential Reads First few reads are treated by the array as Random. Judging by the incoming requests (if determined to be sequential), then data is pre-fetched from the drives in fixed sized chunks and stored in cache. Subsequent reads are met from Cache at wire speed to the requestor. Random/Sequential Writes Normally are staged to cache and then written to disk. Will occur at wire speeds to the requestor. 8 IO Metrics Key IO Metrics are IOPS – Number of IO requests/second issued by the application. IO Size – The size of the IO requests as issued by the application. Latency – The time taken to complete a single IO Operation (IOP). Bandwidth – The anticipated bandwidth that the IO Operations are expected to consume. Latency or response time is the time it takes for 1 IO Operation (IOP) to complete. 264K 264K 264K 1024K 264K 264K 1024K 16K 1024K 1024K 16K 16K 264K 16K 16K 264K 16K 16K 16K 1024K 16K 16K 1024K Source IOPS 16K 16K 16K 264K 16K IOPS Destination 1024K 1024K 16K 16K Bandwidth is the total capacity of the pipe. Bandwidth capabilities are fixed. 9 Datawarehousing Storage Challenges Storage design in a corporate environment is typically Storage Centric - based on Capacity Requirements, not Application Requirements. When applied to Datawarehousing, this results in sub-standard user experience as Datawarehousing is heavily dependent on IO performance. 10 Profiling Datawarehousing IO - Reads 1 · Normally, in a conventional DW, you would notice many reports running against the same set of objects by different users for different requirements at the same time. From a IO performance perspective - Array capabilities along with Raid and Drive configuration determine Read performance in a Datawarehouse. Users Re ad s Typical DW (less than 10TB) · Since the size of the DW is not very big (~5-10TB) and hence the objects are relatively small in size, it is a normal tendency to place these objects on the same set of spindles (Also given the fact that today’s Drives are geared for capacity, not performance). Object1 Object3 Object2 Object4 Object5 Object7 Object6 Object8 Object9 Objectn 2 Database Objects 3 Reads · 4 · High degree of random concurrency (along with write intensive Loads) to single set of disks will absolutely kill your user experience. Due to high concurrency of the requests, about 60% of these read requests end up as Random Read Miss to the Array. Random Reads Miss is the slowest operation on an Array and require such reads to be met from the Disks. Such Random Reads can be big (1MB Sized IOP) or as small as DB Block Size. To accommodate both such requirements, throughput and latency require to be taken into consideration. · 11 Profiling Datawarehousing IO – writes From a IO performance perspective, Cache sizing & Cache reservation along with Raid and Disk configurations determine write performance in a Datawarehouse. 1 ETL Engine · In a typical DW, different write operations occur at different intervals - 24*7*365. These writes can be direct path or conventional. Fast loading of data is important to be able to present the latest information to your customer. Normally, these are driven by rigid SLA’s. Users (Temp Tablespace Writes) · · Writes DW (Typically less than 10TB) Object1 Object3 Object2 Object4 Object5 Object7 Object6 Object8 Object9 Objectn Database Objects · In an Array environment, writes are staged to cache on the Array and then written to disks. Write performance would depend on the size of the cache and the speed at which data can be written to disks. If your cache overflows (Array not being able to keep up with the Writes), then you will see an immediate spike in write response times and corresponding impact on your write operation. 3 · 2 Writes Cache (Memory) on Array · The speed at which data can be written to disks depends on the drive busyness. A combination of reads and writes occurring simultaneously to a single set of spindles will result in poor user experience. This normally happens when you place the objects on the same set of spindles without regard as to their usage patterns. · · · 12 Profiling Datawarehousing IO – Summary To summarize (in Storage Terminology) Enterprise Datawarehousing is an environment in which Performance is important, not just capacity. Read and Write intensive ( Typically 70:30 Ratio) Small (KB) to large sized IOPS (> 1MB) – for both reads and writes. Latency is very important and the IO Operations can consume significant amount of bandwidth. In order to make these requirements more meaningful, you need to put numbers against each of these terms - for e.g. IOPS, bandwidth and Latency - so that a solution can be designed to meet these requirements. 13 Starting the Design Okay I get the idea, so where do I begin? 14 Storage Subsystem Design Process If you have an existing Warehouse 1 2 3 Collect Stats from Oracle Collect Stats from System Collect Stats from Storage Correlate Stats and Summarize/Forward Project 4 If not available, then document requirements as best as you can. If you have an existing Warehouse – Collect stats from all sources and correlate to ensure you are reading it correctly. Requirements Gathering Phase 5 Identify suitable System(s) 6 Identify suitable Array Drive Drive Identify suitable SAN switches 7 RAID If not, then you would have to document your requirements based on an understanding of how your environment will be used and proceed to the design phase. Infrastructure Design 15 Storage Subsystem Design Requirements – contd. If using the data from an existing Warehouse - Do a forward projection, using these stats as raw data, for your design requirements. The existing IO subsystem would be affecting the quality of the stats that you have gathered and you need to factor this in. Separate out Reads and Writes along with the IO size. Document your average and peak numbers at Oracle Level. Anticipated IOPS – Number of IO Requests/Sec. Anticipated IO Request Size – IO Request Sizes as issued by the application for different operations. Acceptable Latency per IO Request. Anticipated Bandwidth requirements as consumed by the IOPS. 16 A Real World approach to the Design In order to make the design process more realistic, let us look at requirements for a DW for a large software company and use these requirements to build a suitable Storage Subsystem. 17 Requirements for a typical Corporate DW (Assuming 5TB in size) Performance The requirements below are as to be seen by Oracle. These are today’s requirements. It is expected that as the database grows, the performance requirements would scale accordingly. Read peaks need not be at the same time as the Write peaks. Same scenario for multiblock/single block traffic. Requirement Acceptable Latency/IO Request Expected IO Request Size Average Peak Average Bandwidth Peak 1492 MB/sec (Using 764K sized IOP) 81.2 MB/sec 262.5 MB/sec (Using 512K sized IOP) 10MB/sec Reads Multi Block Reads < = 30ms >16KB <= 1MB (Average IOP Size 764K) 1200 IOPS Writes Multi Block Writes < 20ms >16KB <= 1MB (Average IOP Size 512K) 400 IOPS Single Block Reads < = 10ms Single Block Writes < 5ms Total 5ms to 30ms 16KB to 1MB 6050 IOPS 8375 IOPS 1.2 GB/sec 1.8 GB/sec 18 16KB 16KB 4000 IOPS 450 IOPS IOPS (IO Requests/Sec) 2000 IOPS 918 MB/sec (Using 764K sized IOP) 5200 IOPS 525 IOPS 200MB/sec (Using 512K sized IOP) 650 IOPS 62.5 MB/sec 7 MB/sec Requirements for a typical EDW (Assuming 5TB in size) – contd. Capacity The database is 5TB in size (Data/Index). And so provide 10TB of usable space on Day 1 to ensure that sufficient space is available for future growth (Filesystems at 50% capacity). Scale performance requirements appropriately for 10TB. Misc IO from redo/archive/backup is not included in the above requirements. The storage subsystem needs to have the ability to handle 1024K IO Request size to prevent IO fragmentation. 19 Conventional Storage thinking Let us look at a typical Corporate Storage Design as a response to the requirements. 20 Requirements to Array & Drive Capabilities The below would be a typical response to the requirements. However as we shall we, implementing as below would result in a failure. Feature Net Bandwidth Consumed Requirement 1.8 GB/sec (Today) 3.6 GB/sec (Tomorrow) > 8375 IOPS 5ms to 30ms 10TB 1024K Determines Writes Performance 1024K Recommended Notes AMS1000 has 8*4Gb Front End Ports for a total of 4 GB/sec Bandwidth 1 Hitachi Modular Array AMS1000 146GB, 15K RPM Drives 165 Drives 10TB Usable (RAID 10) 1024K 16GB RAID10 512K IOPS Latency Capacity Max IO Size Cache Raid Levels Stripe Width Drive Specs of the 146GB, 15K RPM Drive show that the requirements can be easily met. AMS1000 supports a 1024K IOP size. Maximum Cache is 16GB RAID10 offers the best performance. 512K is the maximum offered by the array. 21 Storage Subsystem Design Process - Array What is required is a more thorough analysis of all the components of the storage subsystem and then fit the requirements appropriately to the solution. We start with the Array. This is the most vital part of the solution and is not easily replaceable. 22 Storage Array – Enterprise or Modular? Arrays come in different configurations – Modular, Enterprise, JBOD etc. Modular arrays are inexpensive and easy to manage. They provide good value for money. Enterprise arrays are extremely expensive & offer a lot more functionality geared towards enterprise needs such as wan replication, virtualization and vertical growth capabilities. As I will show later on, vertical scaling of an array is not really conducive for performance. Adding more modular arrays is a cheaper/flexible option. For this presentation, I am using the Hitachi Modular Series AMS 1000 as an example. 23 Typical Modular Array (simplified) Servers Conventional Array specs include SAN Switch Port 1 Port 2 Port x Array Front End Ports to Host Number/Speed of Host Ports (Ports available for the host system to connect to). Size of Cache. Maximum Number of Drives. Number of Raid Controllers. Number of Backend Loops for Drive connectivity. 24 Management CPU Raid Controllers Cache Disk Controllers Drives Oracle requirements to Array Specs Unfortunately Array specs as provided by the vendor do not allow us to match it with Oracle requirements (apart from Capacity). So you need to ask your Array vendor some questions that are relevant to your requirement. 25 Array Specs – contd. (Questions for Array Vendor) · 1 Port 1 Port 2 Port x · · Are these ports full speed? What is the queue depth they can sustain? Maximum IO Size that the Port can accept? Front End Ports to Host 2 Can we manipulate the Cache reservation policy between Reads and Writes? Raid Controllers Cache Management CPU 3 4 How many CPUs in all? Disk Controllers What is the bandwidth available between these components? 5 · How may drives can this array sustain before consuming the entire bandwidth of the array? Optimal Raid Configurations? Drives · Array 26 The HDS AMS1000 (Some questions answered) · 1 · Port 1 Port 2 Port x Are these ports full speed? – Only 4 out of 8 are Full Speed for a peak speed of 2048MB/ sec. What is the queue depth they can sustain? – 512/Port Maximum IO Size that the Port can accept? – 1024K · Front End Ports to Host on hy ac ps 4 T Chi 1066 MB/sec 2 2 Raid Controllers Cache 12 2132 MB/s ec Can we manipulate the Cache reservation policy between Reads and Writes? - No Management CPU 3 What is the bandwidth available between these components? – Effective Bandwidth is 1066 MB/sec U U/ CP ID ller A o 1 R ontr C n chyo 4 Ta ps i Ch .8 2 CP GB /s ec How many CPUs in all? Disk Controllers 4 ex) impl c (S e MB/s Drives 2048 5 · How may drives can this array sustain before consuming the entire bandwidth of the array? – Depends on Drive Performance Optimal Raid Configurations? – Raid 1 or Raid 10. For Raid 5 – Not enough CPU/Cache. Raid 10 – Stripe width 64K default, Upto 512K with CPM (License) · · AMS1000 27 Analyzing the HDS AMS1000 Regardless of internal capabilities, you cannot exceed 1066 MB/sec as net throughput (Reads and Writes). Limited Cache (16GB) and the inability to manipulate cache reservation means that faster and smaller drives would be required to complete writes in time. The 1066MB/sec limit and the Backend Architecture restricts the number of drives that can be sustained by this array. Limited number of CPUs and Cache rule out using RAID 5 as a viable option. 28 Matching the AMS1000 to our requirements Feature Requirement AMS1000 Capability 1066MB/sec (Theoritical) 750 MB/sec (Realistic) Recommendation Notes 1 AMS1000 = 750MB/sec 5 AMS1000 = 3.6 GB/sec 8 AMS1000 = 5.8 GB/sec Net Bandwidth 1.8 GB/sec (Today) 3.6 GB/sec (Tomorrow) Consumed IOPS Latency Capacity Scalability Max IO Size Cache Raid Levels Stripe Width > 8375 IOPS 5 Arrays (Min) 8 Arrays (Recommended) 5ms to 30ms Depends on type of IO Operation, RAID/Drive performance and Drive Capacity. 10TB Future growth 1024K Determines Writes Performance 1024K 1024K 16GB RAID 0, RAID 1, RAID 10, RAID 5 64K, 256K and 512K Need to simulate the requirements along with various drive and raid configurations. 1024K 16GB (Max) RAID 1 and RAID 10 Test to determine stripe width 1024K is supported on the AMS1000. Cache is preset at 50% Reads/Writes Not enough CPU for RAID 5 Beyond 64K, require additional License feature 29 HDS AMS1000 - Conclusions Bandwidth requirements – We would need min of 5 Arrays to meet today + future requirement. Physical hard drives and RAID configuration determine the storage capacity and other performance requirements (IOPS/Latency). Testing various configurations of Drive and RAID levels would determine how desired requirements - (IOPS/latencies) can be met. 30 Storage Subsystem Design Process – The Drives Now that we have established Array capabilities, we can move on to the Drive Selection. 31 Hard Drives Regardless of how capable your array is, the choice of the Drives will ultimately decide the performance. Ultimately all IO gets passed down to the physical hard drives. The performance characteristics (Throughput, IOPS and Latency) vary depending of the type of the IO request and the drive busyness. 32 Hard Drives – FC or SATA or SAS Choice limited by selection of array. The drive interface speed (2Gb/4Gb etc) is not relevant as the bottleneck is the media and not in the interface. SAS is a more robust protocol than FC with native support for dynamic failover. SAS is a switched, serial and point to point architecture whereas FC is Arbitrated Loop at the Backend. The IDE equivalent of SAS is SATA. SATA offers larger capacities at slower speeds. For an Enterprise DW with stringent IO requirements, SAS would be the ideal choice (If Array supports SAS). Faster the drives, better the overall performance. 33 Hard Drives – Capacities – Is bigger better? Bigger drives results in the ability to store more objects resulting in more concurrent requests and thus a more busier drive. 300GB 146GB Object1 Object3 Object2 Object4 Object5 Object7 Object6 Object8 What Capacity should I pick? 450GB Object1 Object3 Object2 Object4 Object1 Object3 Object2 Object4 Object5 Object7 Object6 Object8 Object1 Object3 Object2 Object4 Object5 Object7 Object6 Object8 All offer (supposedly) same performance 167 Random IOPS at 8K IO Size 73-125 MB/sec (Sustained) But if you compare IOPS/GB, then the true picture is revealed. 146GB drive = 1.14 IOPS/GB 300GB drive = 0.55 IOPS/GB 450GB drive = 0.37 IOPS/GB 34 Performance of Drives vis-à-vis Active Surface Usage As free space is consumed on the drive, so does the performance start to degrade. Smaller drives are a better choice for Enterprise Warehousing. 300 15K RPM Drive 250 IOPS Random 8K 200 150 100 50 0 25% 50% 75% 100% % Active Surface Usage 35 Hard Drives Specs Hard Drive specs from Manufacturers typically include the below: Capacity – 146GB, 300GB, 450GB etc Speed – 7.2K/10K/15K RPM Interface Type/Speed – SAS/FC/SATA, 2/3/4 Gb/sec Internal Cache – 16MB Average Latency – 2 ms Sustained Transfer rate – 73-125 MB/sec 36 Oracle requirements to Disk Specs Unfortunately Disk specs as provided by the vendor do not allow us to match it with Oracle requirements (apart from Capacity). Also, Hard Drives are always used in a RAID Configuration (In an Array). So you need to test various RAID Configurations and arrive at conclusions that are relevant to your requirement. 37 RAID, Raid Groups & LUNS RAID is essentially a method to improve drive performance by splitting requests between multiple drives and reduce drive busyness. And provide redundancy at the same time. RAID GROUP 1 LUN 4 LUN 5 RAID GROUP 2 LUN 1 LUN 2 LUN 3 Systems Host Systems see Luns as individual disks (presented by the Array). Array · · Luns are carved out from Raid Groups on the Array. Raid Groups are sets of disks in the Array in pre-defined combinations (Raid 1, Raid 5, Raid 10 etc). 38 RAID Levels – RAID 1 Reads from either drive will help reduce drive busyness. Minimal CPU utilization during routine/recovery operations. Not Cache intensive. RAID 1 MIRROR Since traditional RAID 1 is 1D+1P combination, it would require combining multiple such luns on the system to create big volumes. Writes require 2 IOP (Overwrite existing data) 39 RAID Levels – RAID 5 Reads will be split across drives depending on size of IO request/stripe width and help reduce drive busyness. Each additional IO operation will consume bandwidth within the array. Depending on stripe width, a request may be split between drives. CPU Intensive due to parity bit calculation. DATA1 DATA4 DATA2 DATA5 DATA3 PARITY PARITY DATA6 Writes require 4 IOP (Retrieve Data & Parity into Cache, Update Data & Parity in Cache and then Write Back into Disk) High Write Penalty and hence Cache intensive. High CPU overhead during recovery. Bigger the RAID group (More drives), higher the penalty (especially during recovery). 40 RAID 5 RAID Levels – RAID 10 Writes require 2 IOP (Overwrite existing data) · · MIRROR DATA1 DATA1 Combines both RAID1 and RAID0. Reads from either of the mirrored drives. Reads will be split across drives depending on size of IO request/stripe width. Same advantages of RAID1 with the advantage of striping (scaling across multiple drives). With a bigger stripe width, the IO requests can be met within a single drive. Traditionally Modular Arrays have been able to offer lengths of 64K stripe width only (on a single disk). This means that an IO request exceeding 64K would need to be split across the drives. Splitting across drives means more IOP’s and consuming more backend capacity (overall Array+ Drive busyness). Newer arrays (AMS2500) offer up to 512K stripe width. You can do a combination of RAID1 on the array and stripe on the system (Volume Manager) to overcome the array stripe width limitation. 41 MIRROR DATA2 DATA2 MIRROR DATA3 DATA3 MIRROR DATA4 DATA4 Drive and RAID – Initial Conclusions Since the AMS1000 supports only FC/SATA drives, we will use FC Drives. We will test using 146GB 15K RPM drives. RAID5 is not an option due to high write penalty. RAID10 on the array is not an option as the array can offer only 512K stripe width. Our preference is 1024K stripe width so that a single 1024K multiblock IO request from Oracle can (at best) be met from a single drive. This leaves us with only RAID1 on the array. We can test using RAID1 and RAID10 (Striping on the system) under various conditions. 42 RAID Level Performance Requirements The intent is to identify individual drive performance (in a RAID configuration). This will allow us to determine the number of drives that will be required to meet our requirements. We will simulate peak reads/writes to identify a worst case scenario. 43 Test Methodology to determine Drive performance We will simulate Oracle traffic for 20 minutes using VxBench. We will test on a subset (400GB) of the 5TB expected data volume. Operations Reads Multiblock IOP Single Block OP Operations Writes Multiblock IOP Single Block OP Type Asynchronous Sync Type Asynchronous Sync IO Size 784K 16K IO Size 512K 16K IOPS 156 IOPS 406 IOPS IOPS/sec 41 IOPS 51 IOPS IOPS for 20 minutes 187200 487200 IOPS for 20 minutes 49200 61200 We will generate the required IOPS and measure latency and consumed bandwidth. 44 And the results are .. RAID Config Active Surface Area /Drive Data Drives Feature IOPS Expectation 654 IOPS 147 MB/sec 5ms to 30 ms Actual 567 IOPS 141 MB/sec 46 ms Notes RAID 1 4 Concat Volumes across 4 Raid 1 Luns 68% 400 GB 8 Bandwidth Latency IOPS RAID1 8 Concat Volumes across 8 Raid 1 Luns 654 IOPS 147 MB/sec 5ms to 30 ms 654 IOPS 147 MB/sec 5ms to 30 ms 654 IOPS 147 MB/sec 5ms to 30 ms 642 IOPS 142 MB/sec 15 ms 509 IOPS 136 MB/sec 87 ms 626 IOPS 142 MB/sec 25 ms Linux Host with NOOP Elevator and Vxvm Volumes. 33% 16 Bandwidth Latency IOPS RAID 10 2 Stripe Volumes across 4 RAID 1 luns (Raid 0 on system and Raid 1 on Array) 68% 400 GB 33% 8 Bandwidth Latency IOPS RAID 10 4 Stripe Volumes across 8 RAID 1 luns (Raid 0 on system and Raid 1 on Array) Linux Host with NOOP Elevator and Vxvm Volumes (1MB Stripe Width) 16 Bandwidth Latency 45 Drive and RAID conclusions RAID1 on a Linux Host outperforms a RAID10 combination (RAID0 + RAID1 Combination). To meet our requirements, usable surface area cannot exceed 33% of a single 146 GB, 15K RPM FC Drive. For 10 TB (Day 1 + Future growth), we would need 410 drives of 146GB, 15K RPM Drives. 46 Match requirements to Array and Drive capabilities Now that we have established both Array and Drive capabilities, we can finally match these to our requirements. 47 Requirements to Array & Drive Capabilities Feature Requirement Typical Storage Design Actual Minimum Requirement 5 AMS1000 Arrays Recommended 8 AMS1000 Arrays is preferable. Notes 1 AMS1000 = 750MB/sec 5 AMS1000 = 3.6 GB/sec 8 AMS1000 = 5.8 GB/sec 1 Hitachi Modular Array Net Bandwidth 1.8 GB/sec (Today) Consumed 3.6 GB/sec (Tomorrow) AMS1000 IOPS Latency Capacity Max IO Size Cache Raid Levels Stripe Width > 8375 IOPS 5ms to 30ms 10TB 1024K Determines Writes Performance 1024K 1024K 16GB RAID10 512K AMS1000 can meet the required 1024K IOP Size. 16GB RAID1 NA RAID1 Maximum Cache is 16GB RAID1 (on a Linux system) performed better than RAID10. 48 146GB, 15K RPM Drives 165 Drives 146GB, 15K RPM Drives 450 Drives (410 + 40) 90 Drives/Array 2TB/Array (Usable space) 146GB, 15K RPM Drives 410 Drives to meet Performance 450 Drives (410 + 40) and Capacity Requirements 60 Drives/Array 450 drives (Including Spares) 1.3 TB/Array (Usable Space) Final Thoughts If we had followed the capacity method of allocating storage to the Instance, a single AMS1000 would have been sufficient. But as we discovered, we would require at least 5 arrays to meet requirements. Similarly, the initial recommendation was 165 146GB drives . However we determined that a minimum of 410 drives is required to meet performance requirements. Out of the 146GB of available capacity in the drive, only 49GB is really usable. RAID1 outperforming RAID10 is a surprise, but this may not be case on all platforms. The choice of Operating System, Volume Management and other configuration aspects do influence the final outcome. 49 The Future is Bright As always, Low Price does not equal Low Cost. If you design the environment appropriately, you will spend more initially, but the rewards are plentiful. Modular Arrays are continuously improving and the new AMS2500 from Hitachi has an internal bandwidth capability of 8GB/sec (Simplex). So a single AMS2500 would suffice for our needs from a Bandwidth perspective. Solid State Devices appears to be gaining momentum in the main stream market and hopefully within the next 2 years, HDD will be history. 50 Questions ? 51

Sunday, March 8, 2009

Temporary Tablespaces on Oracle - going the extra mile

During a recent performance debug session on a Linux 10.2.0.3 RAC cluster, I noticed that reads/writes to the temp tablespace were occurring at the rate of 240K/IOP. Even though direct path reads were happening at 1024K/IOP, IOP size to temp tablespaces did not exceed 240K.

I noticed that IO sizes varied between 48K and 240K. A 10046 trace also revealed that Oracle was issuing reads/writes at the max of 15 blocks and min of 3 blocks. The db block size was 16K and thus 15 blocks equals 240K IOP size and 3 blocks equals 48K.

At first, I thought this was a limitation on the raw volumes which we were using for the temporary tablespaces, but a test using dd to the raw volumes showed that we could go upto 4M/IOP without the IO request getting broken up (this was the limit set at the volume level).

I opened a tar with Oracle and it did not get me any where. I found it odd that Oracle would use 15 block sized IOPs. 15 is an odd number and not Oracle style. Just by sheer luck, I was going through the hidden parameters list and came across 2 parameters

_smm_auto_max_io_size 240 Maximum IO size (in KB) used by sort/hash-join in auto mode

_smm_auto_min_io_size 48 Minimum IO size (in KB) used by sort/hash-join in auto mode

I had found the problem - Oracle had set hash-joins and sorts that spill to the disk to not exceed 240KB. The minimum IO size was 48K and the maximum was 240KB. It fit in perfectly with the behavior I noticed.

Note that the parameter mentions "in auto mode" - so if you work area policy is set to auto, then these settings will take effect. The older settings - _hash_multiblock_io_count/_sort_multiblock_read_count has no effect if using auto work area policy and probably still control the hash/sort behaviour when in manual mode.

To me, 240K and 48K are on the lower end of a suitable size for an IOP. I could understand as to why Oracle would keep it at such low values - IO to the temp tablespaces ( if shared) is very much random and with bigger IOP sizes, it can cause a slowdown of your operation. But if your temp tablespaces are well designed, then you could easily use bigger sized IOP requests.

With bigger sized requests, you would reduce the number of waits. But there is a catch with using bigger sized requests - you will end up spending more time on each wait. There will be a sweet spot wherein the net result will be to your choice.

Our Linux RAC cluster was using temporary tablespace groups (4 tablespaces) and had a well designed IO subsystem. Each temp tablespace had a single temp file (raw volume) of 62GB and there were 4 such temp tablespaces. Being a 4 node RAC cluster, we had configured 4 tablespaces assuming that each node would prefer a separate tablespace (node affinity).

Each of the raw volumes was a Raid 1+0 configuration using 4*143GB, 15K RPM drives. The array was an Hitachi AMS1000 with 16GB of cache. The stripe width was 1M on the Raid 1+0 volume. There was a lot of space wasted, however, we were interested in performance and so space was not a concern. So ~250GB of Temp space was laid out across 16 drives (143GB, 15K RPM FC).

In order to test the performance with different sized IOPS, I took one of our "Taste the Thunder" sql's which does an insert of ~460 million rows after doing some large hash-joins on tables with + billion rows. Normally this sql takes around 27 minutes to complete. We do parallel dml and the insert is split across all the 4 nodes.

With the default settings of 240K/48K. (click to enlarge)

Changing max/min to 1024K/512K (click to enlarge)

Changing max/min to 1024K/1024K (click to enlarge)

The below summarizes the findings:

The best results where when we set the max/min to 1024K/512K. We had an improvement of 22% over the default.

As expected, with increasing IOP sizes, the number of waits reduce, however the average wait time increases. The sweet spot in our case was with a max/min setting of 1024K/512K.

With a max/min setting of 1024K/1024K, even though the reads were very fast, the write times increased exponentially - probably the array did not have enough cache to handle such large sized writes.

As always, it is best to test and see how it affects performance.

Saturday, March 7, 2009

Linux, Oracle RAC and the Bonding conundrum

I recently ran into a problem wherein after patching a 4 node Linux RAC cluster, 2 out of 4 instances would not startup. I could not start the instances manually either.

Both querying and trying to restart the crs hung.

On closer look, the crsd.log had entries like

------------------------------------------------------
2009-03-05 23:38:28.600: [ CRSRTI][2541399584]0CSS is not ready. Received status 3 from CSS. Waiting for good status ..

2009-03-05 23:38:29.814: [ COMMCRS][1084229984]clsc_connect: (0xb76f00) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_tus1dwhdbssex04_CODSS))

2009-03-05 23:38:29.814: [ CSSCLNT][2541399584]clsssInitNative: connect failed, rc 9

------------------------------------------------------
I checked the OCSS log and it had entries like the below:

-------------------------------------------------------
CSSD]2009-03-05 23:40:01.566 [1273047392] >ERROR: clssgmSlaveCMSync: reconfig timeout on master 1
[ CSSD]2009-03-05 23:40:01.566 [1273047392] >TRACE: clssgmReconfigThread: completed for reconfig(16), with status(0)
[ CSSD]2009-03-05 23:44:31.610 [2538397376] >ERROR: clssgmStartNMMon: reconfig incarn 16 failed. Retrying.
----------------------------------------------------------

Searching metalink showed no hits, neither did google.

The alertlog for CRS showed that the voting disks was online and the system showed that the interconnects were up. The cluster was setup to use Network interfaces which were bonded together in an Active-Passive mode ( mode=1).

I tried the usual methods - deleting the pipes in /var/tmp/.oracle and restarting the systems a couple of times, however it did not seem to fix the problem. Any attempts to restart crsd from /etc/init.d/init.crsd or crsctl failed. The commands would simply hang and so the nodes had to be force restarted - either by reboot or killing the crs/ocss daemons.

Finally, I checked the active member of the bond interface for the cluster interconnect and found that on 2 of the nodes, the active interface was different from the other 2 nodes. You can identify the active interface by looking into the /proc/net/bonding/bond file.

---------------------
[oracodss@tus1dwhdbssex02] /apps/oracle/home/crs10203/log/tus1dwhdbssex02$ cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v2.6.3 (June 8, 2005)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth5
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth1
MII Status: up
Link Failure Count: 5
Permanent HW addr: 00:1c:23:bf:6e:73

Slave Interface: eth5
MII Status: up
Link Failure Count: 5
Permanent HW addr: 00:15:17:49:75:33
-------------------------------------------------

For the 2 nodes which were up, eth5 was the primary slave, whereas for the 2 down nodes, eth1 was the primary.

So I deleted all the pipes in /var/tmp/oracle, rebooted the 2 down nodes and changed the primary slaves on these 2 nodes to eth5. The instances came back up immediately.

You can change the primary slave using ifenslave or by hard-coding the interface when bringing up the bond interface.

# ifenslave -c bond1 eth5

------modprobe.conf-----------
install bond1 /sbin/modprobe bonding -o bond1 miimon=100 mode=1 primary=eth5
--------------------------------

Active-Passive mode is a common method of bonding in which there is no load-balancing, but only failover (in case a link fails). In a typical HA environment, each interface on the bond is setup to be cabled to a different switch which provides switch level failover capabilities as well. In the Active-Passive mode, the passive link does not even arp, so you would not see any MAC addresses on the switch port for this interface. There is complete isolation between interface 1 and 2.

Imagine a scene wherein Node A uses eth1 as Active and Node B uses eth5 as Active. Even though the Nodes are cabled to the same switches and the links are up/active, Node A will not be able to communicate to Node B.

Here is the funny part - If you setup the cluster interconnect to point to a non-existent interface, I have seen Oracle using the public interface for cluster communications. You will see entries in the instance alert logs mentioning that the cluster interconnect is non-existent/incorrect and so using the public interface.

However, if you set the cluster interconnect to an available and up interface and it is not able to reach the other nodes using the cluster interconnect, it does not try to use the public interface to check connectivity (in the scenario that the vote disks are online/available). Maybe this is a bug? Not surprising at all.

One would assume that a failure of interconnects when the voting disks are online/available should result in more legitimate error logs other than cryptic messages that fill up the CRS logs. This is another kind of split brain condition and probably should be well documented.
Maybe Oracle needs to learn from VCS or other HA products which have been in the market longer and are more stable.

Saturday, February 28, 2009

UltraSparc T1 versus Ultrasparc IV+

There was an interesting thread in orafaq as to poor oracle performance on a T2000/T5240 versus a older V440. While I did test the T2000 when it was first released (2005?) for one of our Datawarehouse Oracle databases and found it to be of magnitudes slower than the UltraSparc IV+, I never really ventured to find out as to why it was slow.

The thread on orafaq really piqued my interest and I decided to do some simple tests. I did not use oracle nor any compute intensive apps since the T2000 is handicapped vis-a-vis Floating Point Instructions capabilities. While it is well known that the Tx series chips are not designed for long running single threaded processes, I wanted to see why exactly it was not performing as well as other CPU's.

The Specs

The differences in CPU are plenty, but the below would probably be of interest:

0. Cores - The UltraSparc IV+ CPU has 2 UltraSparc III CPUs bolted together as a dual core CPU. The T2000 has 8 cores put together in a single CPU. The 8 cores each support 4 HW threads with only 1 thread/core running at a time.

The OS sees each of the HW threads as individual CPUs and schedules processes (LWP) on them. Internally, the T2000 cores switch the HW threads every clock cycle and if any of the HW threads stall due to memory latency. To compare with a conventional CPU, it would be fair to say that 8 cores = 8 CPUs (and not 32 CPUs) since only 1 HW thread can be active on a core at a given point in time.

1. Core Speeds - UltraSparc IV+ is at 1.5Ghz whereas the T1 is at 1.2Ghz.

2. Pipeline - An UltraSparc IV+ Core has 14 stage pipeline and is 4 way superscalar versus the T1 Core is 6 stage pipeline and a scalar design. The T1 core supports 4 threads, but executes only 1 thread at a time. It switches the threads every cycle (as long as there is more than 1 thread to run) or if a thread is stalled.

3. Cache -

L1 Cache/core- 64K/64K (I/D) on UltraSparc IV+ whereas 16K/8K (I/D) on the T1.
L2 Cache/CPU - 2M shared between 2 cores on UltraSparc IV+ whereas 4M shared between 8 cores (32 threads) on T1.
L3 Cache/CPU - 32M on UltraSparc IV+ versus none on T1.

The test

I did a simple word count (wc -l) of a 2GB file on both the T2000 (1 CPU @ 1.2 Ghz with 8 cores, 32 threads) and a V490 (4 CPU @ 1.5Ghz with 2 cores each). I also tested on a V240 (2 UltraSparc III CPU at 1 Ghz). The UltraSparc IV+ is really 2 UltraSparc III CPU's bolted together. So both UltraSparc IV+ and III have the same pipelines, but with differences in the Cache levels.

In order to level the playing field, I did the following.

0. Created processor sets on all the 3 systems.

T2000 - Set 1 with 1HW thread from 1 core. Set 2 with the remaining HW threads from the same core.
V490 - Set 1 with 1 core from 1 CPU. Set 2 with the 2nd core on the same CPU.
V240 - Set 1 with 1 CPU.

1. Bound the test process to processor Set 1 on all the systems. I did this to eliminate/reduce interprocessor cross calls & thread migrations. The intention was to ensure the L1/L2 Cache on the core served only this thread and to reduce cache miss.

2. With creating Processor Set 2, Solaris will not run any other process on this set unless specifically instructed to do so. This is to ensure that this core on T2000/CPU on V490 is completely dedicated to servicing the test process.

3. Changed the process to FX priority and increased the time-quanta to 1000ms. This is to eliminate involuntary context switching and make sure the process gets sufficient time to complete it's activities.

4. Disable interrupts on processor Set 1 and 2 on all systems. This is to prevent the bound Core/CPU from processing interrupts during the running of the process.

Monitoring

I captured stats using the below:

1. mpstat - To capture CPU stats

2. cputrack - To capture instruction cycles, cpu ticks and L2 D miss.

For T2000 - cputrack -evf -t -T1 -c pic0=L2_dmiss_ld,sys,pic1=Instr_cnt,sys -p
For V490 - cputrack -evf -t -T1 -c pic0=L2_rd_miss,sys,pic1=Instr_cnt,sys -p
For V240 - cputrack -evf -t -T1 -c pic0=EC_rd_miss,sys,pic1=Instr_cnt,sys -p

3. ptime - To capture the time taken to do the word count (wc -l). The command was

ptime wc -l test_file

The results

As expected, the UltraSparc IV+ CPU was 3.5 times faster than the T1. The UltraSparc III CPU was 2 times faster than the T1.

Notice the instructions processed/cycle - When all things are the same, the instructions processed/CPU cycle would determine as to how fast your application will run.

In the tests, the T2000 lags considerably behind the other systems. This could be due to the fact that it has only a 6 stage scalar pipeline, whereas the others are 14 stage and superscalar.

Normally, for any CPU, ideally, it would process instructions at the rate of 1 instruction/cycle (assuming there is no stall due to memory latency). However in reality, this does not happen. CPU designers get over this with various tricks such as deep pipelines, superscalar architectures, big caches etc.

Surprisingly, even though the T1 Chip has smaller L2 Cache than the UltraSparc III CPU, it has fewer cache misses/cycle. This could be due to the fact that it is processing less instructions/second and when you look at the L2 D miss/Instruction, it becomes evident that this is indeed the reason.

For single threaded, long running processes, a CPU with deep pipelines and superscalar architecture will be the best fit. For heavily multithreaded processes, wherein a single thread does a small amount of work and exits, the T2000 will scale better than the Ultrasparc IV+.

In either case, for moderate to medium work-loads with any application where latency is important, the UltraSparc IV+ or Intel/AMD cpus will be the best fit. As you start increasing the work-load and expect scalability to be sustained, then the T2000 will provide more linear scalability than the UltraSparc IV+ platform.

A T2000 is like a truck and can carry light, heavy to super heavy loads at 60MPH. What ever be the load, it will run only at 60MPH.

Whereas a UltraSparc IV+ or such CPUs are like a Porsche or a BMW - no load, light to medium loads at speeds >>> 60MPH but as the load increases, you will see the speed drop and ultimately hit 0 MPH (evident with X-86).

I have used the T2000 for Tibco Business Works with great success. It fits the bill perfectly for this application. However, for datawarehousing, it would be a very poor fit.

Performance Engineering and Capacity Planning

Monday, September 21, 2009

Download Link for the "Storage Design for Datawarehousing" presentation

Monday, July 20, 2009

Map Disk Block devices on a Linux host back to the SAN Lun

Monday, July 6, 2009

Storage Design for Datawarehousing

Sunday, March 8, 2009

Temporary Tablespaces on Oracle - going the extra mile

Saturday, March 7, 2009

Linux, Oracle RAC and the Bonding conundrum

Saturday, February 28, 2009

UltraSparc T1 versus Ultrasparc IV+

Blog Archive

Websites

Blogroll