Tuesday, September 18, 2007

Raid 5 Versus Raid 10 for Datawarehouse

Since these days there is considerable efforts to reduce costs and meet Wall street demands, Raid 5 has increased in popularity. Raid 5 on the surface offers redundancy and reasonable performance depending on the profile of the application. There is an increasing push for using Raid 5 for Warehousing too.

However as someone famous said

"Money shows us the price of everything but never the real cost" - so befitting Raid 5.

Cost is taken over the life-time of the solution and could be equated to the TCO. Cost is expecting a report to run in 5 minutes but taking 30 minutes. Cost is the loss of time in waiting for those 25 minutes. Unfortunately, today's Wall Street Guru's are so short sighted - cheap today = money saved today = *10 times money lost over 3 years. Penny wise and Pound Foolish!

Raid 5 would be ideal in a scenario wherein there are small writes (all activity to arrays is more or less random) and small reads. Small meaning IOPS are sized approximately between 4k to 32K. Number of IOPS would depend on the Controller capability. The theory behind this is that the Array caches the writes and then de-stages it to the disks. Similary small reads (if perceived to be sequential) are pre-fetched by the controllers and since it is small reads, there is no big penalty for slower reads.

The above profile can be classified as a Classic OLTP Database.

Take the case of DW - typical load patterns are
  1. Large Random Reads - IOPS vary in size from 128K to 1M. IOPS can be anywhere between 500 to 10,000 IOPS. It is fairly common to see traffic ranging from 350MB/sec to 600MB/sec during periods of heavy activity.
  2. Large Random writes - Same as above.
The general impression is DW consists of large sequential I/O and from a Oracle perspective, it is true.

I/O from an Oracle perspective -

Oracle ---> Segments--> Extents-->DB Blocks-->Filesystem Blocks--> Volume-->Luns

While the access profile from segment to extent to db block may be sequential, on the filesystem and volume to lun it may not be sequential. If Oracle has 500 data files and these data files are spread out across 10 volumes on 50 luns and Oracle is doing I/O to 300 of those files, then from a volume/lun perspective, I/O would mostly be random.

It is up to the various I/O schedulers (Volume, HBA, Array) to merge/sort the I/O requests to minimize head movement. All this has to happen within a short period of time, otherwise you would see I/O starvation. To sum it up, one can say that access from an Array perspective would be mostly random.

So how does Raid 5 handle Random Reads/Writes?

Raid 5 on a Storage Array is typically denoted as 4D+1 or 6D+1 Raid Groups etc wherein it means 4 or 6 data drives and 1 parity drive. Data is written in stripes (stripe width on a disk ranging from 64K to 384K or higher) with a parity bit generated for every stripe and written on one of the drives. Arrays do not give you the option of setting big stripe widths (>1MB) due to the cost of parity calculation.

Parity generation is a CPU intensive task and can be quite tasking when activity on the array is high. Bigger the stripe width, more time spent on calculating the parity.

During a Random Read, prefetch is not going to work as data access is random. There are going to be cache misses and since data is striped, it can be retrieved only in stripe widths . So if the database requests 2MB of records, then it has to divide this 2MB by the stripe width and perform those many operations to retrieve the data.

2MB of data = 32 reads (64K width and worsest case).

Worser is Random Writes - Oracle issues a write to a block and the Array has to retrieve the entire stripe holding the data -

E.g. -

  • Raid Group = 4D+1
  • Size of Lun = 10G.
  • Stripe Width = 64K (on a single disk)
  • Number of Stripes = 10G/64K*4 = ~40000
If Data is in Stripe 10001, then it has to retrieve the entire stripe 10001, make the change, calculate the new parity and then update the parity bit for that stripe.

Under normal circumstances, the Array has sufficient cache to hold these writes and then destage them to the disk. The array will receive the change in cache and issue a write complete to the host.

However cache is limited and when the cache fills up, the host volumes will suddenly show a huge spike in write timings (>1000ms).

The ideal storage for DW is Raid 1 on the Array and Raid 0 using these luns on the system. In my experience, Modular Arrays such as the Hitachi 9585 or AMS1000 perform ideally for DW requirements - provided they are configured correctly. In tests which I performed, Raid 1+0
simply ran circles around Raid 5 especially when it came to Large Random Reads.

Oracle suggestions of using SAME (Stripe and Mirror) are very good for Datawarehousing.

Veritas Volume Manager does a better job than an array in handling writes/reads to striped volumes. The idea is to let Volume Manager handle the pre-fetch and the write staging.

So how does an Array handle Raid 1 requests?

Random Reads - No prefetch as this is random in nature, however reads simultaneously from both drives in chunks of 500K or as per Array specs. Volume Manager does any prefetch required.

Random Writes - Uses Cache to destage. No parity calculation, so much faster writes. Writes are expensive because we are writing to 2 disks, but not so expensive as writing to 5 disks or more as in Raid 5.

Raid 0 using 1MB Stripe Width on the system will help in large reads. Writes will be efficient if they occur in multiple of 1MB chunks.

All in all, Raid 1+0 is a winner. It is best to use Modular arrays as these are cheaper and can be dedicated to Warehousing.

1 comment:

Anonymous said...

Really good that you have started documenting your experience !!
I have already started advertising your blog :)))