On Tuesday 12 June 2001 12:29, Alan Cox wrote:

> If your algorithm can work well with say 2Gb windows on the data and only
> change window evey so often (in computing terms) then it should be ok, if
> its access is random you need to look at a 64bit box like an Alpha, Sparc64
> or eventually IA64

No, eventually Sledgehammer.

You know that IA64 will never ship, because the performance will always suck. 
 The design is fundamentally flawed.  The real bottleneck is the memory bus.  
These days they're clock multiplying the inside of the processor by a factor 
of what, twelve?  Fun as long as you're entirely in L1 cache and aren't task 
switching.  But that's basicaly an embedded system.

CISC instructions are effectively compressed.  When you exhaust your cache 
your next 64 bit gulp can get more than 2 instructions, you can get more like 
5 (which still means you're getting about 1/4 the performance of in-cache 
execution, although L2 and L3 caches help here, of course..)

But optimizing for something that isn't actually your bottleneck is just 
dumb, and that's exactly what Intel did with the move to VLIW/EPIC/IA64.  3 
times 64 bit instructions is 192, whereas Pentium can fit more than that in a 
single 64 bit chunk.  Brilliant.  You need what, a 6x larger cache just to 
break even with the amount of time you're running in-cache?  And of course 
the compiler is GOING to put NOPs in that because it won't always be able to 
figure out something for the second and third cores to do this clock, 
regardless of how good a compiler it is.  That's just beautiful.

This is why they were so desperate for RAMBUS.  They KNOW the memory bus is 
killing them, they were grasping at straws to fix it.  (Currently they're 
saying that a future vaporware version of iTanium will fix everything, but 
it's a fundamental design flaw: the new instruction set they've chosen is WAY 
less efficient going across the memory bus, and that's the real limiting 
factor in performance.)

Transmeta at least sucks CISC in from main memory to keep the northbridge 
bottleneck optimized.  And they have a big cache, and they're still using 32 
bit instructions so they only need 96 bytes per chunk (or atom or whatever 
they're calling it these days).

Sledgehammer is the interesting x86 64 bit instruction set.  You still have 
the cisc/risc hybrid that made pentium work.  (And, of course, backwards 
compatability that's inherent in the design rather than bolted onto the side 
with duct tape and crazy glue.)  Yeah the circuitry to make it work is 
probably fairly insane, but at least the Athlon design team made all the racy 
logic go away so they can migrate the design to new manufacturing processes 
without four months of redesign work.  (And no, making insanely long 
pipelines in Pentium 4 is not a major improvement there either.  Cache 
exhaustion still kills you, so does branch misprediction.  Stalling a longer 
pipeline is more expensive.)

I saw an 800 mhz iTanium which benchmarked about the same (running 64 bit 
code) as a Penium III 300 mhz running 32 bit code.  That's just sad.  Go 
ahead and blame the compiler.  That's still just sad.  And a design flaw in 
the new instruction set is not something that can be easily optimized away.

Rob
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to