Saturday, February 28, 2009

UltraSparc T1 versus Ultrasparc IV+

There was an interesting thread in orafaq as to poor oracle performance on a T2000/T5240 versus a older V440. While I did test the T2000 when it was first released (2005?) for one of our Datawarehouse Oracle databases and found it to be of magnitudes slower than the UltraSparc IV+, I never really ventured to find out as to why it was slow.

The thread on orafaq really piqued my interest and I decided to do some simple tests. I did not use oracle nor any compute intensive apps since the T2000 is handicapped vis-a-vis Floating Point Instructions capabilities. While it is well known that the Tx series chips are not designed for long running single threaded processes, I wanted to see why exactly it was not performing as well as other CPU's.

The Specs

The differences in CPU are plenty, but the below would probably be of interest:

0. Cores - The UltraSparc IV+ CPU has 2 UltraSparc III CPUs bolted together as a dual core CPU. The T2000 has 8 cores put together in a single CPU. The 8 cores each support 4 HW threads with only 1 thread/core running at a time.

The OS sees each of the HW threads as individual CPUs and schedules processes (LWP) on them. Internally, the T2000 cores switch the HW threads every clock cycle and if any of the HW threads stall due to memory latency. To compare with a conventional CPU, it would be fair to say that 8 cores = 8 CPUs (and not 32 CPUs) since only 1 HW thread can be active on a core at a given point in time.

1. Core Speeds - UltraSparc IV+ is at 1.5Ghz whereas the T1 is at 1.2Ghz.

2. Pipeline - An UltraSparc IV+ Core has 14 stage pipeline and is 4 way superscalar versus the T1 Core is 6 stage pipeline and a scalar design. The T1 core supports 4 threads, but executes only 1 thread at a time. It switches the threads every cycle (as long as there is more than 1 thread to run) or if a thread is stalled.

3. Cache -
  • L1 Cache/core- 64K/64K (I/D) on UltraSparc IV+ whereas 16K/8K (I/D) on the T1.
  • L2 Cache/CPU - 2M shared between 2 cores on UltraSparc IV+ whereas 4M shared between 8 cores (32 threads) on T1.
  • L3 Cache/CPU - 32M on UltraSparc IV+ versus none on T1.
The test

I did a simple word count (wc -l) of a 2GB file on both the T2000 (1 CPU @ 1.2 Ghz with 8 cores, 32 threads) and a V490 (4 CPU @ 1.5Ghz with 2 cores each). I also tested on a V240 (2 UltraSparc III CPU at 1 Ghz). The UltraSparc IV+ is really 2 UltraSparc III CPU's bolted together. So both UltraSparc IV+ and III have the same pipelines, but with differences in the Cache levels.

In order to level the playing field, I did the following.

0. Created processor sets on all the 3 systems.
  • T2000 - Set 1 with 1HW thread from 1 core. Set 2 with the remaining HW threads from the same core.
  • V490 - Set 1 with 1 core from 1 CPU. Set 2 with the 2nd core on the same CPU.
  • V240 - Set 1 with 1 CPU.
1. Bound the test process to processor Set 1 on all the systems. I did this to eliminate/reduce interprocessor cross calls & thread migrations. The intention was to ensure the L1/L2 Cache on the core served only this thread and to reduce cache miss.

2. With creating Processor Set 2, Solaris will not run any other process on this set unless specifically instructed to do so. This is to ensure that this core on T2000/CPU on V490 is completely dedicated to servicing the test process.

3. Changed the process to FX priority and increased the time-quanta to 1000ms. This is to eliminate involuntary context switching and make sure the process gets sufficient time to complete it's activities.

4. Disable interrupts on processor Set 1 and 2 on all systems. This is to prevent the bound Core/CPU from processing interrupts during the running of the process.

Monitoring

I captured stats using the below:

1. mpstat - To capture CPU stats

2. cputrack - To capture instruction cycles, cpu ticks and L2 D miss.
  • For T2000 - cputrack -evf -t -T1 -c pic0=L2_dmiss_ld,sys,pic1=Instr_cnt,sys -p
  • For V490 - cputrack -evf -t -T1 -c pic0=L2_rd_miss,sys,pic1=Instr_cnt,sys -p
  • For V240 - cputrack -evf -t -T1 -c pic0=EC_rd_miss,sys,pic1=Instr_cnt,sys -p
3. ptime - To capture the time taken to do the word count (wc -l). The command was
  • ptime wc -l test_file

The results

As expected, the UltraSparc IV+ CPU was 3.5 times faster than the T1. The UltraSparc III CPU was 2 times faster than the T1.




Notice the instructions processed/cycle - When all things are the same, the instructions processed/CPU cycle would determine as to how fast your application will run.

In the tests, the T2000 lags considerably behind the other systems. This could be due to the fact that it has only a 6 stage scalar pipeline, whereas the others are 14 stage and superscalar.

Normally, for any CPU, ideally, it would process instructions at the rate of 1 instruction/cycle (assuming there is no stall due to memory latency). However in reality, this does not happen. CPU designers get over this with various tricks such as deep pipelines, superscalar architectures, big caches etc.

Surprisingly, even though the T1 Chip has smaller L2 Cache than the UltraSparc III CPU, it has fewer cache misses/cycle. This could be due to the fact that it is processing less instructions/second and when you look at the L2 D miss/Instruction, it becomes evident that this is indeed the reason.

For single threaded, long running processes, a CPU with deep pipelines and superscalar architecture will be the best fit. For heavily multithreaded processes, wherein a single thread does a small amount of work and exits, the T2000 will scale better than the Ultrasparc IV+.

In either case, for moderate to medium work-loads with any application where latency is important, the UltraSparc IV+ or Intel/AMD cpus will be the best fit. As you start increasing the work-load and expect scalability to be sustained, then the T2000 will provide more linear scalability than the UltraSparc IV+ platform.

A T2000 is like a truck and can carry light, heavy to super heavy loads at 60MPH. What ever be the load, it will run only at 60MPH.

Whereas a UltraSparc IV+ or such CPUs are like a Porsche or a BMW - no load, light to medium loads at speeds >>> 60MPH but as the load increases, you will see the speed drop and ultimately hit 0 MPH (evident with X-86).

I have used the T2000 for Tibco Business Works with great success. It fits the bill perfectly for this application. However, for datawarehousing, it would be a very poor fit.