Wednesday, December 15, 2010

A tool for analyzing Storage Performance - SWAT from Oracle

Storage is probably the most commonly found bottleneck in  majority of the databases.

Most Database Administrators are not Storage Engineers and vice-versa.  For a Database Administrator, translating a Database requirement/problem into Storage terminology is not easy - even with all the stats available from the Oracle Database.  

Generally, there are 2 common storage issues faced - performance problems and capacity planning related.


As regards performance, rather than saying  "I am seeing a lot of storage related waits , so Storage is slow, can you please fix it" -  it is preferable to talk to your Storage Admin in a language he can understand.

Remember, Storage Engineers look at storage performance from a LUN perspective and not from an application stats perspective. They would be least concerned about log file parallel write wait times being 20ms. On the other hand, if you tell them that LUN 10 and 11 (being used for redo) are write intensive with a lot of small block IOPS and you am seeing high response times, it makes perfect sense to them.

From a capacity planning and performance trending perspective, how are we to actually track changes in storage performance over a period of time? Capacity planning is not just about free space, it is an overall look at the health of your storage subsystem with respect to increasing application demands.

Since it is the Database Administrator making the request, it would probably make sense for the DBA to learn Storage terminology. Knowledge is power and knowing your requirements well would make all the difference and possibly earn the respect of your Storage Engineer.

Most Storage Engineer generally look for the below with regards to addressing performance issues and capacity planning (today and future)

  1. Size of the working Data Set
  2. Number of IOPS being generated
  3. Size of an IOP
  4. % Reads versus % Writes
  5. Expected Response Time versus actuals
  6. Data Rate or Bandwidth consumed today
  7. And so on.

We can get into a lot more details into each of the above line items. To keep it simple and begin a conversation , this would suffice.

How are we to obtain this information from an Oracle Database?


Yes, it is possible to get this information from the various v$views and stats, however what we need is a simple tool which can monitor the system and collect/collate/trend all this information in a meaningful manner.



There are commercial tools available, however we need something which is lightweight, flexible and potentially available at no added cost.

SWAT is a tool from Oracle and for the features and capabilities it offers, it is very light weight and efficient. 
SWAT stands for Sun StorageTek Workload Analysis Tool. There is no equivalent to SWAT that I have seen.  Incredibly enough, it is not very well known.


It takes a few minutes to set it up and can run as a non-privileged user. As long as java is installed, you are good to go. Supported on Solaris, Linux and Windows systems.

It can be setup to run continuously in the background and so is invaluable for capacity planning requirements.

To get a copy of SWAT, I would suggest that you approach your Oracle Support contact. The author of SWAT is Henk Vandenbergh. He is also the author for vdbench which is an excellent tool for testing IO Subsystems.

Some screen shots from SWAT.



Tuesday, November 23, 2010

SATA versus SAS - Realistic Expectations

As a Performance Engineer, I do benchmarks all the time.  And I frequently get asked about SAS versus
SATA drives.  While it is a no-brainer as to which performs better, SATA drives are priced about 50% of that of SAS drives and offer a lot more capacity.

Price and capacity are attractive enough, but under what circumstances can a SATA drive perform comparably to a SAS drive?  As I do benchmarks on Oracle databases, I did a few simple tests to ascertain the performance characteristics of SATA versus SAS.

The drives I used for the tests are from Seagate. The SATA Drive is the Seagate Constellation ES 7200 RPM 2TB Drive with SAS interfaces. The SAS Drive is the Seagate Cheetah 15K RPM 600GB Drive with SAS interfaces. 

Both are 3.5inch drives.

The below table summarizes some key specs of these drives.

















The newer generation of SATA drives are equipped with SAS interfaces. Basically they look the same except for internals. Of particular interest is the Areal Density showing that SATA drives are far more densely packed than SAS.


For the test, I created a partition of 64GB from the outer most sectors of the drive. From my experience,
for reasonable performance from a hard drive, short stroking is a must. The degree of short stroking a drive depends on how much storage you are willing to sacrifice over performance. To give an example, for optimal performance, a 146GB SAS drive must be short stoked to ~50GB (1/3rd of the size).

I used the partition as-is with no filesystem - basically a raw device.


I performed 3 tests on the Drives using vdbench. Vdbench is a tool from Oracle and is highly flexible in terms of testing options. It is authored by Henk Vandenbergh from Oracle.

  1. Transaction System - Small reads/writes of 16K IO size simulating single block activity - 70% Reads + 30% Writes. 
  2. DW System - Large reads/writes of 1024K IO size simulating multiblock activity - 70% Reads + 30% writes.  
  3. Hybrid System - Combination of 16K and 1024K IO sizes - 70% Reads + 30% writes

I generally only focus on response time as this is what an application such as Oracle will report on via the wait events. Ideally a single block IO request would complete in < 5ms (peak) and a large block IO request in < 20ms (peak).

This is hard to meet with Spinning Media during high concurrency and so I would settle for peak sustainable response time < 15-20 ms.

The number of IOPS is dependent on Use Case Scenarios and cannot be generalized.

The graphs below correlate IOPS versus Response Time.  As the number of requests (IOPS) increase, the response time for a IOP will start increasing.  

We would be more interested in where exactly the hard drive fails (in the sense it cannot deliver a predictable response time)or cannot satisfy the number of IO requests.

Transaction System Test


If you are considering SATA for a Transaction System, then depending on number of single block requests and your threshold for response times, it may be cost effective to use SATA. Looking at the below graph, for a 16K block, it would appear that 100 IOPS/drive would be about the maximum a SATA drive can sustain without falling of a cliff.   If using an 8K block size, it would be even better.
























DW System Test

A DW system is typically characterized by smaller number of large block IO when compared to a transaction system. As you can see below, a SATA drive cannot sustain more than 30-40 IOPS before experiencing a failure in response time. And it cannot sustain more than 68 IOPS.

A SAS drive is a lot more scalable and predictable than a SATA drive. Even doubling the number of SATA drives cannot equal the performance of a single SAS drive. So SAS would be a much better fit for DW than SATA.


Hybrid System Test

For a Hybrid System, again, it may not make sense to use SATA. You can squeeze in a few more IOPS out of the drive (max of 105) before it fails, but the response times are high.


To summarize,  for low to mid volume transaction systems, SATA drives may be quite affordable and also deliver reasonable performance.  For DW and Hybrid systems, it is driven by use-case and potentially not effective or efficient to use SATA.

Saturday, June 19, 2010

Comparing Data Warehouse Appliances - Evaluation Criteria from an End user perspective

It is so easy to loose track of the big picture when comparing different DW Appliance vendors. Different technologies bring different strengths to the table.  In that perspective, in a recent project, I had been working on a list of criteria to make it easy to evaluate vendors.

There is no "One size fits all" and there is no perfect Appliance. Every appliance has its perceived strengths and weaknesses. It is up to the end user to evaluate and identify as to which Appliance will fit the needs adequately.

From what I have seen, it is a combination of many factors responsible for a successful and stable deployment.

I hope this selection criteria will be useful for others too. The presentation can be downloaded from here.

DW Appliance - Evaluation Criteria                                                                                                                                   

Tuesday, February 2, 2010

Joins - ANSI SQL versus Traditional Oracle (+) syntax

Sometime back (maybe with 9i?), Oracle introduced support for the ANSI join syntax - inner join, full outer join, left outer join and right outer join. This could be used in place of the (+) operator.

I came across a curious problem on Oracle 10.2.0.4. The explain plans were markedly different (with different execution times as well) when the sql was written using ANSI syntax versus traditional oracle (+) syntax. The traditional syntax was a lot more faster and even with passing hints with the ANSI style, I could not force a similar execution plan.

ANSI Style


Traditional Oracle Syntax




As to why the CBO was behaving such, I have no idea. But if you do run into problems with joins, it may make sense to change the sql to traditional oracle syntax.

Wednesday, January 27, 2010

Understanding CPU Time as an Oracle Wait event

We have all seen "CPU Time" as a Top 5 wait event at one point or the other when troubleshooting performance issues. From an administrator (DBA/SA) perspective, conventional thinking would be that CPU is a bottleneck.

But what if the stats from the system show that CPU Utilization (% Util and Run queue) are well within thresholds and show plenty of available capacity, but Oracle continues to report CPU time as a Top 5 wait event?

We are also seeing high degree of involuntary context switching (using mpstat in Solaris) or context switches (using vmstat in Linux). Obviously, something is not computing right.


CPU Time could mean that the process is either
  • On a CPU run queue waiting to be scheduled
  • Or currently running on a CPU.
Obviously, we are interested in
  • Minimizing the wait time on the run queue so that the session can run on the CPU as soon as possible. This is determined by the priority of the process.
  • And once running on the CPU, be allowed to run on the CPU to complete its tasks. The amount of time available for the process on the CPU is defined as the Time Quanta.
Both of the above aspects are controlled by the OS Scheduler/Dispatcher. From the wiki page for scheduling

"Scheduling is a key concept in computer multitasking, multiprocessing operating system and real-time operating system designs. In modern operating systems, there are typically many more processes running than there are CPUs available to run them. Scheduling refers to the way processes are assigned to run on the available CPUs. This assignment is carried out by software known as a scheduler or is sometimes referred to as a dispatcher"

Understanding how the scheduler shares CPU resources is key to understanding and influencing the wait event "CPU Time".

In any Unix platform, there are processes which take higher priority than others. Labeling a process as higher priority can be done through the implementation of Scheduling classes and with the nice command. Both can have different effects on the process.

An easy method to identify the scheduling class and current priority for a process is to use the ps command. Used with the "-flycae" arguments, it shows both the scheduling class and current priority. However it does not show the CPU time quanta associated with a process.


dbrac{714}$ ps -flycae |egrep "ora_" |more

S   UID    PID   PPID   CLS   PRI  CMD
S   oracle 931   1      TS    24  ora_p000_DW
S   oracle 933   1      TS    24  ora_p001_DW

In the above example, you would be interested in the CLS and PRI column. The above example shows that the oracle background processes as running under the TS Scheduling class with a priority of 24. The higher the number reported in the PRI column, the higher the priority.

The default Scheduler for user processes is TS or Time Share and is common across Solaris and Linux. The TS scheduler changes priorities and CPU time quantas for processes based on recent processor usage.


Since we appear to have plenty of CPU resources, we could draw the conclusion that the default (TS) scheduling class does not appear to be good enough for us. Either the scheduler is not allocating sufficient CPU time quanta (resulting in involuntary context switching) or not giving the process a sufficiently higher priority so that it can be scheduled earlier than other processes.

So how do we change it? Obviously we would want to
  • set a fixed priority for Oracle processes so that they are able to run on the CPU ahead of other competing processes.
  • set a fixed time quanta for Oracle processes so that they can run to completion on the CPU.
With either Solaris or Linux, the easiest way to implement this is to change the Scheduling class for the oracle processes. Both the Operating systems offer a Scheduling class with Fixed Priorities and Fixed CPU Time Quantas - Fixed in the sense it is fixed throughout the lifetime of the process, but can be changed to suit your requirements at any time.

In Linux, it is the RR class and in Solaris it is the FX class. The simplest way to change the scheduling class is to use the priocntl tool. While it is a native binary on Solaris, it is available on Linux through the Heirloom Project.

On Linux, you would need to use the renice command to change the CPU time quantas and on Solaris, priocntl does both - scheduling class and time quanta.

Let us look at a few examples -


On Linux - Let us try and change the Scheduling Class and Time Quanta for the Log Writer.


[root@dbrac root]# ./priocntl -l
CONFIGURED CLASSES
==================

TS (Time Sharing)
Configured TS User Priority Range: -19 through 20

RT (Real Time Round Robin)
Maximum Configured RT Priority: 99

FF (Real Time First In-First Out)
Maximum Configured FI Priority: 99

We see that there are 3 Scheduling classes available for use.

[root@dbrac root]# ps -flycae |grep ora_lgwr

S   UID    PID   PPID CLS   PRI   CMD

S   oracle 30318 1    TS    23    ora_lgwr_DWRAC

It shows that LGWR is running in TS class with a Priority of 23.

[root@dbrac root]# ./priocntl -d 30318
TIME SHARING PROCESSES
PID      TSUPRI

30318     0

Let us change LGWR to RT class with a RT priority of 50.

[root@dbrac root]# ./priocntl -s -c RT -p 50 -i pid 30318

[root@dbrac root]# ./priocntl -d 30318
REAL TIME PROCESSES
PID     RTPRI   TQNTM
30318   50      99


It shows that the RT priority is 50 and the Time Quanta is 99.

[root@dbrac root]# ps -flycae |grep ora_lgwr

S  UID    PID    PPID  CLS  PRI  CMD
S  oracle  30318 1     RR   90   ora_lgwr_DWRAC


Note that even though the RT priority is 50, ps shows the PRI as 90.

Let us change the time quanta for the Log writer.

[root@dbrac root]# renice +2 30318
30318: old priority 0, new priority 2

[root@dbrac root]# ps -flycae |grep ora_lgwr

S   UID    PID   PPID CLS PRI CMD
S   oracle 30318 1    RR  90 ora_lgwr_DWRAC


No change in the PRI after renicing a RT process (expected).

[root@dbrac root]# ./priocntl -d 30318
REAL TIME PROCESSES
PID      RTPRI  TQNTM
30318    50      89


But when checking with priocntl, we see that the time quanta is now 89 (Previous was 99).

Let us see if we can increase the time quanta.


[root@dbrac root]# renice -3 30318
30318: old priority 2, new priority -3


[root@dbrac root]# ./priocntl -d 30318
REAL TIME PROCESSES
PID     RTPRI    TQNTM
30318    50       459


Now the time quanta is 459. Higher the time quanta, the more time the process can spend on the CPU before being context switched out.

For Solaris - priocntl can be used to set the Scheduling class and the time quanta simultaneously. I am not going to show any examples here as it would be the same as above.

Now, as to which processes (background/shadow) need to have a higher priority than others, that is a decision which requires significant amount of testing. I have seen 30% improvements in load timings when changing scheduling properties, however it has the potential to completely break the environment if not done correctly.

Interestingly enough, when running Oracle RAC on Linux, you would notice that the lms process are now running under the RR Scheduling class.


dbrac{720}$ ps -flycae |grep ora_lms

S    UID    PID   PPID CLS   PRI   CMD

S    oracle 9074  1    RR    41    ora_lms0_DWRAC
S    oracle 9078  1    RR    41    ora_lms1_DWRAC


[root@dbrac root]# ./priocntl -d 30306
REAL TIME PROCESSES
PID     RTPRI    TQNTM
30306   50       99

Tuesday, January 26, 2010

ora-4030 errors and swap sizing on DataWarehousing platforms

"ora-04030: out of process memory when trying to allocate %s bytes (%s,%s)"

Now, what would ORA-4030 errors have to do with swap sizing?

Now, my perspective is limited to DataWarehouse systems, so this would probably be relevant to such environments only. This is also on the Solaris platform, though a similar analogy could work on Linux too.

In DataWarehousing, PGA sizes can get fairly large and overall PGA consumption can vary dramatically depending on the type and nature of extracts/queries and the degree of parallelism.

I have seen PGA consumption as high as 2.5x PAT.

Since PGA is anonymous memory (private to the process), it would need to have swap reservations in place. Or in other words, an equivalent amount of the PGA would be allocated in swap to serve as a backing store. This backing store would be used in case of actual physical memory shortages. If using DISM, then there would be the swap reservations in place for the SGA also. I have used the word reservations/allocations and for the sake of this discussion, they are the same.

The important point to keep in mind is that regardless of whether you have actual memory shortages resulting in swap usage, swap allocation/reservation would always occur.

Consider this scenario - (Assuming /tmp is empty)

Operating System - Solaris
Memory on system = 64GB
Swap Device on Disk = 25GB

Memory Requirements
OS + other apps = 2GB
SGA = 12GB (ISM and so no swap requirements)
PAT = 24GB

If all your processes are consuming 24GB of aggregated PGA, then the swap backing store would require to be 24GB in size. Including the OS requirements, the total swap backing store would need to be 26GB in size.

In Solaris, this swap reservation/allocation can be met by a combination of free physical memory and/or physical swap device.

In this case, it is quite possible that the 26GB of swap backing store can be met entirely from free physical memory. If not, then sufficient space from the physical swap device (25GB in our case) would be used.

Now what happens when we start exceeding 24GB? Let us look at peak PGA usage.

Peak PGA usage (2X PAT) = 48GB
Total Peak Memory usage = 48GB + 12GB + 2GB = 62GB
Available free physical memory = 64GB - 62GB = 2GB

48GB of PGA would need 48GB of swap reservation. 2GB of OS requirements would also require swap reservations for a total of 50GB of swap reservations.

Since the peak total memory requirement is 62GB, only 2GB of free physical memory is available. The physical swap device is 25GB in size, thus making for only 27GB of possible available swap space. Obviously, the system cannot handle 48GB of PGA consumption. Swap is short by 23GB (50GB - 27GB).

Even though there is still 2GB free memory, you will definitely encounter ORA-4030 errors. Along with the 4030 errors, you would also see in the /var/adm/messages file

messages:Jan 20 20:20:56 oradb genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 9489 (oracle)
messages:Jan 20 20:23:15 oradb genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 9327 (oracle)



So how much can the PGA grow with a 25GB Swap Device (in the above scenario)? Somewhere around 36GB would be about right.

PGA = 36GB
SGA = 12GB
OS = 2GB

Total consumed Physical memory = 50GB (36GB + 12GB + 2GB)
Available free physical memory = 14GB (64GB - 50GB)
Available Swap = (64-50) + 25GB = 39GB

So the recommendation would be to have physical swap devices of atleast 3X PAT (assuming your peak PGA utilization is 2.5X PAT). This way you would not run into ORA-4030 errors due to insufficient swap space.