On 11/09/2011 07:21 PM, Pascal wrote:
Le Tue, 8 Nov 2011 16:25:22 -0800,
Nat Echols<nathaniel.ech...@gmail.com> a écrit :
On Tue, Nov 8, 2011 at 4:22 PM, Francois Berenger<beren...@riken.jp>
wrote:
In the past I have been quite badly surprised by
the no-acceleration I gained when using OpenMP
with some of my programs... :(
You need big parallel jobs and avoid synchronisations, barriers or this
kind of things. Using data reduction is much more efficient. It's working
very well for structure factors calculations for exemple.
Amdahl's law is cruel:
http://en.wikipedia.org/wiki/Amdahl's_law
You can have much less than 5% of serial code.
I have more problems with L2 misse cache events and memory bandwidth. A
quad cores means 4 times the bandwidth necessary for a single process...
If your code is already a bit greedy, the scale up is not good.
I never went down to this level of optimization.
Are you using valgrind to detect cache miss events?
After gprof, usually I am done with optimization.
I would prefer to change my algorithm and would be afraid
of introducing optimizations that are architecture-dependent
into my software.
Regards,
F.