Hi guys

I'm benchmarking our (well tested) parallel code on and AMD based system, 
featuring 2x AMD Opteron(TM) Processor 6276, with 16 cores each for a total of 
32 cores. The system is running Scientific Linux 6.1 and OpenMPI 1.4.5.

When I run a single core job the performance is as expected. However, when I 
run with 32 processes the performance drops to about 60% (when compared with 
other systems running the exact same problem, so this is not a code scaling 
issue). I think this may have to do with core binding / NUMA, but I haven't 
been able to get any improvement out of the bind-* mpirun options.

Any suggestions?

Thanks in advance,
Ricardo

P.S: Here's the output of lscpu

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
CPU socket(s):         2
NUMA node(s):          4
Vendor ID:             AuthenticAMD
CPU family:            21
Model:                 1
Stepping:              2
CPU MHz:               2300.045
BogoMIPS:              4599.38
Virtualization:        AMD-V
L1d cache:             16K
L1i cache:             64K
L2 cache:              2048K
L3 cache:              6144K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14
NUMA node1 CPU(s):     16,18,20,22,24,26,28,30
NUMA node2 CPU(s):     1,3,5,7,9,11,13,15
NUMA node3 CPU(s):     17,19,21,23,25,27,29,31

---
Ricardo Fonseca

Associate Professor
GoLP - Grupo de Lasers e Plasmas
Instituto de Plasmas e Fusão Nuclear
Instituto Superior Técnico
Av. Rovisco Pais
1049-001 Lisboa
Portugal

tel: +351 21 8419202
fax: +351 21 8464455
web: http://golp.ist.utl.pt/


Reply via email to