Sunday, December 28, 2008

Analyzing the impact of the Vxfs filesystem block size on Oracle

I am usually asked as to what should be the ideal Vxfs filesystem block size for an Oracle DB block size of 16K. I always reply - 8K (maximum on Vxfs).

All along, my reasoning was that if using say 1K filesystem block size, then a 16K oracle block read would end up as 16/1K IO requests to the filesystem and the same for writes. With a filesystem block size of 8K, you would be reduced from 16/1K requests to 16/8K requests - or so I thought..

I decided to test to see what exactly was happening and it proved that I was wrong – at least with respect to Vxfs.

Firstly some background about Vxfs –

Vxfs is an extent based filesystem – meaning it allocates space to files not as blocks, but as extents. Extents are contiguous set of filesystem blocks. Extent sizes vary and also the method of creation of a file greatly influences extent sizing. As a file grows, more extents are added to the file. 

The interesting part about Vxfs and extents is that IO is never split across extents and a request for contiguous set of blocks within an extent is satisfied with a single request. If split across extents, then it will result in multiple IO requests – quite similar to how db file scattered read would split a request between oracle extents.  From the Vxfs guide -

"By allocating disk space to files in extents, disk I/O to and from a file can be done in units of multiple blocks. This type of I/O can occur if storage is allocated in units of consecutive blocks. For sequential I/O, multiple block operations are considerably faster than block-at-a-time operations. Almost all disk drives accept I/O operations of multiple blocks."

So coming back to Oracle – some test scenarios

I decided to test and see for myself.

The environment is Solaris 9 on a E4900 with Storage Foundation for Oracle Enterprise Edition. Oracle is 10.2.0.3 using VRTS ODM. 

I created 2 tablespaces – one on a 1K filesystem and the other on a 8K filesystem. Each had 1 datafile of size 5g. 


Identical tables with ~1000 rows were created on both the tablespaces.  Indexes were created on both tables on relevant columns.

On a 1K Filesystem block size and a 16K DB Block size

First to confirm 1K block size

root@oracle:> fstyp -v /dev/vx/rdsk/oracledg/test1k
vxfs
magic a501fcf5 version 6 ctime Sat Dec 27 22:52:33 2008
logstart 0 logend 0
bsize 1024 size 15728640 dsize 0 ninode 15728640 nau 0
defiextsize 0 ilbsize 0 immedlen 96 ndaddr 10
aufirst 0 emap 0 imap 0 iextop 0 istart 0

I initiated both sequential and scattered reads on the tables.

A vxtrace showed that oracle was issuing requests for 16K or bigger sized requests and they were single IOs. They were not broken up into smaller IO requests as one would have normally expected.  I could not use truss because IO requests show up as ioctl calls when using ODM. There was no read I/O smaller than 32 blocks (16K) thus confirming that IOs are not split based on filesystem blocks.

------------------------------------------
1254 START read vol test1k op 0 block 4326176 len 32 <----- 16K Reads
1254 END read vol test1k op 0 block 4326176 len 32 time 0

--------CUT---------

1260 START read vol test1k op 0 block 4326048 len 128 <------ 64K Reads
1260 END read vol test1k op 0 block 4326048 len 128 time 0
1261 START read vol test1k op 0 block 4326176 len 32
1261 END read vol test1k op 0 block 4326176 len 32 time 0
1262 START read vol test1k op 0 block 4325792 len 128
1262 END read vol test1k op 0 block 4325792 len 128 time 0

------------CUT------------------------------

On a 8K Filesystem block size and a 16K DB Block size

To confirm the block size is indeed 8k

root@oracle:> fstyp -v /dev/vx/rdsk/edwrsdg/test8k
vxfs
magic a501fcf5 version 6 ctime Sat Dec 27 22:52:47 2008
logstart 0 logend 0
bsize 8192 size 655360 dsize 0 ninode 655360 nau 0
defiextsize 0 ilbsize 0 immedlen 96 ndaddr 10
aufirst 0 emap 0 imap 0 iextop 0 istart 0

I did the same set of reads as done for the 1k filesystem and it was the same.

------------CUT-----------

1265 START read vol test1k op 0 block 4326048 len 128 <------ 64K reads
1265 END read vol test1k op 0 block 4326048 len 128 time 0
1266 START read vol test1k op 0 block 4326176 len 32 <--------- 16K reads
1266 END read vol test1k op 0 block 4326176 len 32 time 0
1267 START read vol test1k op 0 block 4325888 len 32
1267 END read vol test1k op 0 block 4325888 len 32 time 0

------------CUT----------------

So the reads behave exactly like how it is documented. Oracle will do reads only in multiples of db block sizes. On either a 1K or 8k Vxfs block filesystem, a 16K or multiples of 16K reads would be sequential reads of contigous blocks and hence be satisfied from within a single IO request - as long as the IO request can be met from a single extent. 

So from an IO perspective, it really does not matter if using 1K or 8K.

Now there is other aspect to this - file system overhead, fragmentation, extent sizing and space management.

1K filesystem block size would reduce space wastage at a cost of having to manage a lot many blocks (filesytem overhead) whereas 8K filesystem block size would be ideal for an oracle instance using a DB block size of 8K or higher.

From a filesystem management perspective, using 8K filesystem block size makes better sense as Oracle would not ever store data in a size less than the DB Block size. An 8K filesystem block size reduces the number of blocks and correspondingly the filesystem overhead in maintaining these blocks.  I do not know if anyone uses a 4K DB Block size any more. All I have seen are 8K and higher. 

To reduce fragmentation,  it is best if the datafile is using a single extent (as will be when created on a database using VRTS ODM).  The extent here refers to the Vxfs Extents and not Tablespace extents. To maintain as a single Vxfs extent, datafiles should never be extended and always new datafiles should be added to increase tablespace capacity.

You can find out the extents allocated to a file by running vxstorage_stats - it is an invaluable tool.  Fragmentation status can be identified by running fsadm. Normally when using ODM, fragmentation should be minimal.

2 comments:

Anonymous said...

Hi very interesting post, but I have question (may be stupid) using ODM is important the bsize? Oracle shouldn't interface with the fs as a row device?
Thanks cheers Marco

SSK said...

Hi Marco,

Thanks for stopping by. When ODM is enabled, oracle does not bypass the filesystem and try to do I/O directly to the raw device.

Along with other features, ODM enables direct I/O with async/sync depending on the nature of the request. Direct I/O bypasses the file-system buffer cache.

ODM enables raw device like performance, but through a filesystem interface.

To get a better understanding of ODM, I would refer you to an excellent document titled Oracle Disk Manager written by Nitin Vengulerkar.

Thanks
Krishna