Hi guys I'm benchmarking our (well tested) parallel code on and AMD based system, featuring 2x AMD Opteron(TM) Processor 6276, with 16 cores each for a total of 32 cores. The system is running Scientific Linux 6.1 and OpenMPI 1.4.5.
When I run a single core job the performance is as expected. However, when I run with 32 processes the performance drops to about 60% (when compared with other systems running the exact same problem, so this is not a code scaling issue). I think this may have to do with core binding / NUMA, but I haven't been able to get any improvement out of the bind-* mpirun options. Any suggestions? Thanks in advance, Ricardo P.S: Here's the output of lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 CPU socket(s): 2 NUMA node(s): 4 Vendor ID: AuthenticAMD CPU family: 21 Model: 1 Stepping: 2 CPU MHz: 2300.045 BogoMIPS: 4599.38 Virtualization: AMD-V L1d cache: 16K L1i cache: 64K L2 cache: 2048K L3 cache: 6144K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14 NUMA node1 CPU(s): 16,18,20,22,24,26,28,30 NUMA node2 CPU(s): 1,3,5,7,9,11,13,15 NUMA node3 CPU(s): 17,19,21,23,25,27,29,31 --- Ricardo Fonseca Associate Professor GoLP - Grupo de Lasers e Plasmas Instituto de Plasmas e Fusão Nuclear Instituto Superior Técnico Av. Rovisco Pais 1049-001 Lisboa Portugal tel: +351 21 8419202 fax: +351 21 8464455 web: http://golp.ist.utl.pt/