On 01/30/09 02:39 PM, Elad Lahav wrote: >> I don't think that follows. >> >> You've put together a compute intensive benchmark - so the constraint >> on that code is how many instructions per second you can execute. >> >> It doesn't follow that it is analogous to a webserver. A webserver is >> not generally so compute intensive. If there's lot's of memory stall, >> then the T1/T2 will use that stall time to get useful work done on >> other threads, most other processors will have idle pipelines until >> the stall resolves. [A benchmark for this situation would be something >> like running multiple copies of the latency measurements in lmbench.] >> >> This is a workload characterisation question. Just because platform A >> under performs platform B on workload C it does not follow that it >> will also under perform on workload D. > > I never claimed the compute-intensive benchmark is indicative of > web-server performance. I was claiming (and I may be wrong) that the > difference between the T1 and the Xeon on this benchmark is supposed to > give a lower bound on the difference in any other workload, since it is > heavily biased in favour of the T1's strengths.
I'm not sure that it is heavily biased. That's the argument I'm making. The code is compute intensive, so basically your test is a test of pure instruction throughput. However, the T1 is really designed for memory intensive codes. The ideal situation is where you have 32 threads stalled waiting on memory vs 4 on a different CPU. Hence the best case for the T1 would be made using multiple copies of a memory latency test (not that this is a very useful test ;). So I don't think this test represents a lower bound. It's just another data point. > If anything, I would > expect the difference between the two processors to grow once we exit > the realm of perfectly-parallelised, integer-compute-intensive > applications. I would disagree. This domain is likely to be the best domain for the Xeon - the data probably has few cache misses, or significant reuse. So it's just how fast instructions can be executed. The Xeon is clocked higher and is superscalar, plus it has multiple cores. I'm sure you can calculate the peak instruction issue rate and compare that to the T1. > > > It doesn't follow that it is analogous to a webserver. A webserver is > not generally so > > compute intensive. If there's lot's of memory stall, then the T1/T2 > will use that stall > > time to get useful work done on other threads, most other processors > will have idle > > pipelines until the stall resolves. [A benchmark for this situation > would be something > > like running multiple copies of the latency measurements in lmbench.] > > The quad-core Xeon would compensate for the stalling with superscalar > mechanisms. No, superscalar doesn't compensate for stalling. It allows the processor to issue more instructions per cycle (ie it's faster when there's work ready). Being out-of-order does enable the processor to compensate, but you can only compensate if there's work to be done. For example Ld [%g1],%g1 Ld [%g1],%g1 So the second instruction cannot execute until the first completes. In this instance an OoO processor would perhaps get a few instructions ahead before their reorder buffer fills up and they stall. T1/T2 would stall, and other threads would get the use of the pipeline. The length of the stall depends on memory latency, which is likely to be similar on the two platforms. > > I am not trying to argue against the T1 in any way. This specific Xeon > processor is 2 years newer, runs faster and has a much larger L2 cache > than the T1. I suspect that it also consumes much more energy and > produces more heat. Sure, I know that, nor am I trying to dis the Xeon ;) > > The only point of this exercise was to assess, based on the results of > both the macro and micro benchmarks, whether it would be worthwhile to > invest more time in optimising and tuning the T1000 to get comparable > SPECweb results to those of the Xeon machine. I think that the answer is > that it will do no use to tune it further, since there is an inherent > advantage to the Xeon in all performance measurements. Again, I may be > wrong, and would be glad to hear different opinions. I was unable to locate the specweb numbers for this particular variant of Xeon. I suspect that the T2000 numbers would be for a higher clock than the T1000 numbers, but the pair of them should give some idea of what the T1 can achieve. The Xeon numbers that I did locate seemed to indicate that it would be faster than the T1, but an exact comparison wasn't possible. So I would agree with your conclusion, but not necessarily the method behind it. However, I fear we may have bored some of the other folks on this alias, so feel free to harangue me directly :) Regards, Darryl. > > Thanks, > --Elad -- Darryl Gove Compiler Performance Engineering Blog: http://blogs.sun.com/d/ Book: http://www.sun.com/books/catalog/solaris_app_programming.xml _______________________________________________ perf-discuss mailing list perf-discuss@opensolaris.org