On Thu, Oct 18, 2007 at 02:47:44PM +1000, skaller wrote: > > On Thu, 2007-10-18 at 12:02 +0800, Biplab Kumar Modak wrote: > > skaller wrote: > > > On Wed, 2007-10-17 at 18:14 +0100, Biagio Lucini wrote: > > >> skaller wrote: > > > > > >> It would be interesting to try with another compiler. Do you have access > > >> to another OpenMP-enabled compiler? > > > > > > Unfortunately no, unless MSVC++ in VS2005 has openMP. > > > I have an Intel licence but they're too tied up with commerical > > > vendors and it doesn't work on Ubuntu (it's built for Fedora and Suse). > > > > > If possible, you can post the source code. I've a MSVC 2005 license (I > > bought it to get OpenMP working with it). > > > > I can then give it a try. I have a dual core PC. :) > > OK, attached.
On LU_mp.c according to oprofile more than 95% of time is spent in the inner loop, rather than any kind of waiting. On quad core with OMP_NUM_THREADS=4 all 4 threads eat 99.9% of CPU and the inner loop is identical between OMP_NUM_THREADS=1 and OMP_NUM_THREADS=4. I believe this benchmark is highly memory bound rather than CPU intensive, so the relative difference between OMP_NUM_THREADS={1,2,4} is very likely not in what GCC or other OpenMP implementation does, but in what kind of cache patterns it generates. OMP_NUM_THREADS=1 /tmp/LU_mp; OMP_NUM_THREADS=2 GOMP_CPU_AFFINITY=0,1 /tmp/LU_mp; \ OMP_NUM_THREADS=2 GOMP_CPU_AFFINITY=0,2 /tmp/LU_mp; OMP_NUM_THREADS=4 /tmp/LU_mp Completed decomposition in 4.830 seconds Completed decomposition in 5.970 seconds Completed decomposition in 9.140 seconds Completed decomposition in 11.480 seconds shows this quite clearly. This Intel quad core CPU shares 4MB L2 cache between core 0 and 1 and between core 2 and 3. So, if you run the two threads on cores sharing the same L2 cache, it is only slightly slower than one thread, while running it on cores with different L2 caches shows a huge slowdown. So, I very much doubt you'd get much better results with other OpenMP implementations. I believe how the 3 arrays are layed out on the stack is what really matters most for this case, the synchronization overhead is in the noise. Jakub