On Thu, 2007-10-18 at 13:04 +0200, Jakub Jelinek wrote:
> On Thu, Oct 18, 2007 at 02:47:44PM +1000, skaller wrote:

> On LU_mp.c according to oprofile more than 95% of time is spent in the inner
> loop, rather than any kind of waiting.  On quad core with OMP_NUM_THREADS=4
> all 4 threads eat 99.9% of CPU and the inner loop is identical between
> OMP_NUM_THREADS=1 and OMP_NUM_THREADS=4.  I believe this benchmark is highly
> memory bound rather than CPU intensive, so the relative difference between
> OMP_NUM_THREADS={1,2,4} is very likely not in what GCC or other OpenMP
> implementation does, but in what kind of cache patterns it generates.

This does seem quite plausible. However LU is a typical
numerical routine. This seems to limit the utility of
shared memory parallelism quite significantly: seems like
you're better off copying the matrix to another computer,
just to get two memory buses, since that's the bottleneck.

Has anyone got any comparative data on the libstdc++ parallel
model STL code?


-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

Reply via email to