On May 6, 1:21 am, Peter Jeremy <peterjer...@optushome.com.au> wrote:
> On 2009-May-04 19:26:13 -0700, William Stein <wst...@gmail.com> wrote:
>
> >Unfortunately, it seems that this does *NOT* mean that if you write a
> >little C program, spawn 128 threads, and watch them run, then you can
> >do 128 times what you would do with 1 thread.  You still can only do 8
> >times as much as with 1 thread.   For raw computation, I don't think
> >that processor is any better than 8 single cores.
>
> Each T2 core provides a single integer ALU and a single FPU (plus a
> crypto unit which can probably be ignored for this purpose), thus an
> 8-core T2 only has 8 ALUs.  Each core is hardware sliced between 8
> threads - giving the 64 threads-per-chip.  When a thread stalls (eg
> waiting for a memory access), the hardware will switch that core to a
> different thread.  For small programs, this doesn't gain you much
> because everything is cached.  You should see decent speedups on code
> that is cache-busting: If you try running 128 copies of a program to
> (eg) transpose a 2000x2000 matrix of longs or doubles then you should
> see better speedup.

Well, code like ATLAS is tuned for minimizing cache misses, so I
seriously doubt that the design of the T2 will give us more benefit
past 8 simultaneously running threads.  The same more or less applies
to arbitrary precision arithemtic.

> Note that for real-world apps, kernel and serialisation overheads
> can seriously hurt you.

I don't think that is really that much of an issue with HPC code while
it does play much more of a role with DB/Java/Webserver workloads
where the T2 is making a pretty good figure. But Intel has been
catching up fast in that area, i.e. in a year or two a 2 socket Xeon
might be comparable in throughput in case Sun keeps bumbling their
hardware roadmap. And given the Oracle buy as well as the history of
Rock I guess you can see how I am not exactly optimistic here.

> BTW, anyone looking at using a T1 should be aware that it only has a
> single FPU shared by all cores.  This means FP performance is "poor".

Well, the T1 had a GMP score comparable to the 68040. The T2 is about
10 times faster for a single thread than the T1 (with the above
potential scaling issues), but still lacking the current Xeon
generation by about a factor of 30 per core. So even if the T2 scaled
perfectly to 128 cores on that benchmark it would still be beaten by a
single 6 core Xeon of the Dunnington generation. And there are already
Xeons based on the i7 generation out. Some of this is certainly due to
the worst assembly support for Sparc64 in GMP (and MPIR) compared to
x86-64, but even if this gets fixed and improved by a factor of 10
which some people have thrown around this isn't enough IMHO.

Cheers,

Michael

> --
> Peter Jeremy
>
>  application_pgp-signature_part
> < 1KViewDownload
--~--~---------~--~----~------------~-------~--~----~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to 
sage-devel-unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~----------~----~----~----~------~----~------~--~---

Reply via email to