Re: [perf-discuss] Performance of 32-bit vs 64-bit benchmark

Darryl Gove Fri, 30 Jan 2009 15:54:21 -0800


On 01/30/09 02:39 PM, Elad Lahav wrote:
>> I don't think that follows.
>>
>> You've put together a compute intensive benchmark - so the constraint 
>> on that code is how many instructions per second you can execute.
>>
>> It doesn't follow that it is analogous to a webserver. A webserver is 
>> not generally so compute intensive. If there's lot's of memory stall, 
>> then the T1/T2 will use that stall time to get useful work done on 
>> other threads, most other processors will have idle pipelines until 
>> the stall resolves. [A benchmark for this situation would be something 
>> like running multiple copies of the latency measurements in lmbench.]
>>
>> This is a workload characterisation question. Just because platform A 
>> under performs platform B on workload C it does not follow that it 
>> will also under perform on workload D.
> 
> I never claimed the compute-intensive benchmark is indicative of 
> web-server performance. I was claiming (and I may be wrong) that the 
> difference between the T1 and the Xeon on this benchmark is supposed to 
> give a lower bound on the difference in any other workload, since it is 
> heavily biased in favour of the T1's strengths.


I'm not sure that it is heavily biased. That's the argument I'm making. 
The code is compute intensive, so basically your test is a test of pure 
instruction throughput.

However, the T1 is really designed for memory intensive codes. The ideal 
situation is where you have 32 threads stalled waiting on memory vs 4 on 
a different CPU. Hence the best case for the T1 would be made using 
multiple copies of a memory latency test (not that this is a very useful 
test ;).

So I don't think this test represents a lower bound. It's just another 
data point.

> If anything, I would 
> expect the difference between the two processors to grow once we exit 
> the realm of perfectly-parallelised, integer-compute-intensive 
> applications.

I would disagree. This domain is likely to be the best domain for the 
Xeon - the data probably has few cache misses, or significant reuse. So 
it's just how fast instructions can be executed. The Xeon is clocked 
higher and is superscalar, plus it has multiple cores. I'm sure you can 
calculate the peak instruction issue rate and compare that to the T1.


> 
>  > It doesn't follow that it is analogous to a webserver. A webserver is 
> not generally so
>  > compute intensive. If there's lot's of memory stall, then the T1/T2 
> will use that stall
>  > time to get useful work done on other threads, most other processors 
> will have idle
>  > pipelines until the stall resolves. [A benchmark for this situation 
> would be something
>  > like running multiple copies of the latency measurements in lmbench.]
> 
> The quad-core Xeon would compensate for the stalling with superscalar 
> mechanisms.

No, superscalar doesn't compensate for stalling. It allows the processor 
to issue more instructions per cycle (ie it's faster when there's work 
ready). Being out-of-order does enable the processor to compensate, but 
you can only compensate if there's work to be done.

For example

Ld [%g1],%g1
Ld [%g1],%g1

So the second instruction cannot execute until the first completes. In 
this instance an OoO processor would perhaps get a few instructions 
ahead before their reorder buffer fills up and they stall. T1/T2 would 
stall, and other threads would get the use of the pipeline. The length 
of the stall depends on memory latency, which is likely to be similar on 
the two platforms.


> 
> I am not trying to argue against the T1 in any way. This specific Xeon 
> processor is 2 years newer, runs faster and has a much larger L2 cache 
> than the T1. I suspect that it also consumes much more energy and 
> produces more heat.


Sure, I know that, nor am I trying to dis the Xeon ;)

> 
> The only point of this exercise was to assess, based on the results of 
> both the macro and micro benchmarks, whether it would be worthwhile to 
> invest more time in optimising and tuning the T1000 to get comparable 
> SPECweb results to those of the Xeon machine. I think that the answer is 
> that it will do no use to tune it further, since there is an inherent 
> advantage to the Xeon in all performance measurements. Again, I may be 
> wrong, and would be glad to hear different opinions.

I was unable to locate the specweb numbers for this particular variant 
of Xeon. I suspect that the T2000 numbers would be for a higher clock 
than the T1000 numbers, but the pair of them should give some idea of 
what the T1 can achieve. The Xeon numbers that I did locate seemed to 
indicate that it would be faster than the T1, but an exact comparison 
wasn't possible.

So I would agree with your conclusion, but not necessarily the method 
behind it.

However, I fear we may have bored some of the other folks on this alias, 
so feel free to harangue me directly :)

Regards,

Darryl.


> 
> Thanks,
> --Elad

-- 
Darryl Gove
Compiler Performance Engineering
Blog: http://blogs.sun.com/d/
Book: http://www.sun.com/books/catalog/solaris_app_programming.xml
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Re: [perf-discuss] Performance of 32-bit vs 64-bit benchmark

Reply via email to