Monday, September 21, 2009

Download Link for the "Storage Design for Datawarehousing" presentation

It appears that scribd does not allow anonymous downloads any more. Hence on request, the "Storage Design for Datawarehousing" presentation can be downloaded in PDF format here.

2 comments:

karlarao said...

Hi Krishna,

Before I ask you a lot of questions.. I would say that this post is really good stuff.. :) putting in scenario how to build a balanced system for Data Warehouse. What I like the most are slides #26-27, the questions for the array vendor... and the graphical stuffs that you put into it that made the complicated topics easy to grasp..

However, I've got some questions.. out of curiosity..


1) On slide #15... you have this statement on requirements gathering.. "if not available, then document requirements as best as you can"

mmm... I have a couple of problems:

- On usual cases it is a new system or application and although you'll have an idea on how many will be using the application, you'll not be able to build enough statistics or numbers to get the requirements you need for you to match it to the storage capacity/performance... this is also the same with CPU...

- And also you may not have a reference environment for the new application.. (on your post, where did you get these numbers?)
So given this predicament it would be safer to put an assumption to the proposal with something like,

"With the information available, there is a significant risk the CPU/storage subsystem may not provide enough capacity to meet the application requirements."

But, you still have to recommend the CPUs/storage.. & you have no choice.. :) What would you do with this kind of scenario ?
I've even posted a question on Oracle-l.. and did not get any reply.. http://www.freelists.org/post/oracle-l/Experiencesthoughts-about-hardware-recommendations


2) On slide #13... you have this statement on one bullet.. "Small (KB) to large sized IOPS (> 1MB) – for both reads and writes." ...

- how did you measure the io size on the system and oracle level?
- And by that you have concluded that the storage subsystem needs to have the ability to handle 1024K IO Request size to prevent IO fragmentation.. what are your proofs behind it?
- Did you also consider the filesystem blocks size against the LUN stripe size and array stripe size? normally on linux we have 4KB filesystem block size.. but I'm also curious how will this affect the performance if I size it the same as my Oracle block size.. hmm...


3) On slide #34..

- what is the computation behind the IOPS/GB?
- and why is the picture revealed by having the stats for IOPS/GB?


4) Also on slides #44 and #45

- how you were able to get the single and multiblock io sizes on the 400GB data volume?
- what measurement tool did you use to get the Active Surface Area/Drive?


5) Can you recommend some reading materials? on the area of Data Warehouse performance? also on storage and building a balanced system?



Thanks in advance!

Happy Holidays :)


- Karl Arao

Krishna Manoharan said...

Hi Karl,

Thank you for stopping by.

Question 1 - How can I document requirements for a new environment? I do not have an existing baseline.

There is no easy answer for this. It would be need to be a calculated guess and I would lean towards over-engineering rather than under-engineering (to be on the safe side).

The requirements in the presentation are from a moderate to medium sized Enterprise DW for a medium sized software company. Enterprise DW here would mean internal reporting (Peoplesoft, ERP, CRM etc). I gathered these from an existing DW.

Question 2 - how did you measure the io size on the system and oracle level?

System side - You can see the IO sizes using iostat or vxdmpadm or vxstat.

Oracle side - For reads, MBRC would decide the maximum read size (Oracle max is 1MB and cannot exceed this).

For writes, I have seen redo writes easily exceed 4M (with big commit intervals and heavy DML activity).

- And by that you have concluded that the storage subsystem needs to have the ability to handle 1024K IO Request size to prevent IO fragmentation.. what are your proofs behind it?

I do not understand this question. Can you please re-phrase?

- Did you also consider the filesystem blocks size against the LUN stripe size and array stripe size? normally on linux we have 4KB filesystem block size.. but I'm also curious how will this affect the performance if I size it the same as my Oracle block size.. hmm...

As regards filesystem block sizes, my experience has always been with Vxfs which is an extent based filesystem. Block size is not really relevant as long as you read contiguous blocks and not cross the filesystem extents. You can read about it here.

http://dsstos.blogspot.com/2008/12/analyzing-impact-of-vxfs-filesystem.html

I am not familiar with Ext3.

As regards volume stripe widths (Array or Volume), it is kind of mixed bag. Traditionally, one would use the widest stripe width that you can use. Here I refer to width being the size of the chunk on one disk. However, my experience has been that Array or Volume striping is not as efficient as using ASM style striping. Or even simple Oracle Extent striping.

There is very little chance that an Oracle block (or multiples) will align with the Volume stripe.

Question 3 - what is the computation behind the IOPS/GB?
- and why is the picture revealed by having the stats for IOPS/GB?


It is dividing the numbers of rated IOPS by the size of the drive.

It gives you an indication of how much capacity of a drive is really usable to achieve your performance needs.

Question 4 - How you were able to get the single and multiblock io sizes on the 400GB data volume?

Single block IO size is simply the DB Block Size.

Multiblock IO size would be multiples of DB Block Size upto a max of 1MB. Since it can vary between DB Block Size to 1MB, I estimated that the average size would be 768K.

If you have an existing DW, you can see the Tablespace/File IO stats in AWR reports and see the average number of Blocks read per IO. This is another way to measure your IO sizes from an Oracle perspective.

- what measurement tool did you use to get the Active Surface Area/Drive?

The Active Surface Area/Drive is the usable capacity for a drive - for a 146GB drive, it is typically 133GB and such. Based on my results, I concluded that for optimal performance, only 33% of a 146GB drive is really usable. The rest can be discarded.

Question 5 - Can you recommend some reading materials? on the area of Data Warehouse performance? also on storage and building a balanced system.

Actually I am not aware of any such books.. It is mostly real-world hands-on experience.

From an Oracle perspective, my favorite authors are Jonathan Lewis and Steve Adams. I have learned much from their book(s) and posts and I would recommend their books to anyone (even though they are 8i and 9i).

Please do let me know if you have any further questions.

Thanks
Krishna