On Thu, 2007-10-18 at 13:04 +0200, Jakub Jelinek wrote: > On Thu, Oct 18, 2007 at 02:47:44PM +1000, skaller wrote:
> On LU_mp.c according to oprofile more than 95% of time is spent in the inner > loop, rather than any kind of waiting. On quad core with OMP_NUM_THREADS=4 > all 4 threads eat 99.9% of CPU and the inner loop is identical between > OMP_NUM_THREADS=1 and OMP_NUM_THREADS=4. I believe this benchmark is highly > memory bound rather than CPU intensive, so the relative difference between > OMP_NUM_THREADS={1,2,4} is very likely not in what GCC or other OpenMP > implementation does, but in what kind of cache patterns it generates. This does seem quite plausible. However LU is a typical numerical routine. This seems to limit the utility of shared memory parallelism quite significantly: seems like you're better off copying the matrix to another computer, just to get two memory buses, since that's the bottleneck. Has anyone got any comparative data on the libstdc++ parallel model STL code? -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net