Hi,
I'm benchmarking our (well tested) parallel code on and AMD based system,
featuring 2x AMD Opteron(TM) Processor 6276, with 16 cores each for a total of
32 cores. The system is running Scientific Linux 6.1 and OpenMPI 1.4.5.
When I run a single core job the performance is as expected. However, when I
run with 32 processes the performance drops to about 60%
Be aware that on AMD CPUs based on Bulldozer/Interlagos technology 2
cores share the FPU units of one module. There is also a problem with
Cross-Cache-Invalidations [1] in earlier kernel versions - be sure to
use an up-to-date kernel (2.6.32-220.7.1)
Cheers,
Nico
[1] http://developer.amd.com/Assets/SharedL1InstructionCacheonAMD15hCPU.pdf