Hello,
I was running the pi demo on multicores (Pentium Dual, Core i7) to see
it scale, but sometimes the time measurements return disparate results.
The FAQ suggests processor affinity as one possible reason for that. For
instance, the demo takes 3s in one core of the Pentium Dual
$ mpirun -h
mpirun (Open MPI) 1.4.2
...
$ mpirun -c 1 pi
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs Time (s) Error
1 2.963499 6.332712E-13
and approx. half the time in both cores, sometimes
$ mpirun -c 2 pi
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs Time (s) Error
2 1.535394 2.291500E-13
$ mpirun -c 2 pi
...
2 2.497539 2.291500E-13
Say 6/10 times it takes 2.5s, 4/10 it takes 1.5s.
So I followed FAQ remarks about paffinity
$ mpirun -mca mpi_paffinity_alone 1 -c 2 pi
...
2 1.496346 2.291500E-13
$ mpirun -mca mpi_paffinity_alone 1 -c 2 pi
...
2 2.527654 2.291500E-13
Say 2/10 times it takes 2.5s. I'm not sure that's the way it's expected
to work, getting the not-bound-to-core time once in five tries.
I guess I'm doing something wrong. I tried to discover if the ranks are
being effectively bound. It seems they are.
$ mpirun -mca mpi_paffinity_alone 1 -report-bindings -c 2 pi
[nodo0:03233] [[51755,0],0] odls:default:fork binding child [[51755,1],0] to
cpus 0001
[nodo0:03233] [[51755,0],0] odls:default:fork binding child [[51755,1],1] to
cpus 0002
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs Time (s) Error
2 1.536590 2.291500E-13
$ mpirun -mca mpi_paffinity_alone 1 -report-bindings -c 2 pi
[nodo0:03236] [[51758,0],0] odls:default:fork binding child [[51758,1],0] to
cpus 0001
[nodo0:03236] [[51758,0],0] odls:default:fork binding child [[51758,1],1] to
cpus 0002
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs Time (s) Error
2 2.556353 2.291500E-13
Then I thought it might be related to ompi believing the node was being
oversubscribed and thus entering Degraded Mode instead of Agressive
(altough this pi demo has only one send-recv at the end of the code, so
it's hard to believe 1s diff is due to that). Cool this -nooversubscribe
switch, to quickly make sure about it
i$ mpirun -nooversubscribe -mca mpi_paffinity_alone 1 -report-bindings -c 2 pi
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2 slots
that were requested by the application:
If I correctly understand, slots can be only specified in a hostfile,
not in mpirun switches. Is that true?
$ cat hf
n0 slots=2 max-slots=2
$ mpirun -hostfile hf -mca mpi_paffinity_alone 1 -report-bindings -c 3 pi
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
that were requested by the application:
...
cpd@nodo0:~/ompi$ mpirun -hostfile hf -mca mpi_paffinity_alone 1
-report-bindings -c 2 pi
[nodo0:03444] [[52222,0],0] odls:default:fork binding child [[52222,1],0] to
cpus 0001
[nodo0:03444] [[52222,0],0] odls:default:fork binding child [[52222,1],1] to
cpus 0002
^[[A
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs Time (s) Error
2 1.502448 2.291500E-13
cpd@nodo0:~/ompi$ mpirun -hostfile hf -mca mpi_paffinity_alone 1
-report-bindings -c 2 pi
[nodo0:03447] [[52221,0],0] odls:default:fork binding child [[52221,1],0] to
cpus 0001
[nodo0:03447] [[52221,0],0] odls:default:fork binding child [[52221,1],1] to
cpus 0002
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs Time (s) Error
2 2.540400 2.291500E-13
Sometimes you need 10-15 tries to get 2.5s. That's much more reliable,
but I'm not sure I'm getting it right. Is it the expected way it works?
Is it normal to get the not-bound-to-core time once in 10-15 tries? Am I
missing something else?
I see there are also rankfiles, so I tried this
$ cat rf
rank 0=n0 slot=0:0
rank 1=n0 slot=0:1
cpd@nodo0:~/ompi$ mpirun -rf rf -report-bindings -c 2 pi
[nodo0:03512] [[52018,0],0] odls:default:fork binding child [[52018,1],0] to
slot_list 0:0
[nodo0:03512] [[52018,0],0] odls:default:fork binding child [[52018,1],1] to
slot_list 0:1
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs Time (s) Error
2 2.503292 2.291500E-13
cpd@nodo0:~/ompi$ mpirun -rf rf -report-bindings -c 2 pi
[nodo0:03515] [[52017,0],0] odls:default:fork binding child [[52017,1],0] to
slot_list 0:0
[nodo0:03515] [[52017,0],0] odls:default:fork binding child [[52017,1],1] to
slot_list 0:1
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs Time (s) Error
2 1.546936 2.291500E-13
I got 4/10 times 2.5s, and the same (well, 5/10) if I also include
-hostfile hf, so it seems I am still missing something. Probably that
uncontrolled "something" is what the statistics 2/10 - 4/10 hang on.
And editing the files to point to n1, one of the Core i7 nodes...
cpd@nodo0:~/ompi$ grep cores /proc/cpuinfo
cpu cores : 2
cpu cores : 2
cpd@nodo0:~/ompi$ ssh n1 grep cores /proc/cpuinfo
cpu cores : 4
cpu cores : 4
cpu cores : 4
cpu cores : 4
$ cat hf
n1 slots=4 max-slots=4
cpd@nodo0:~/ompi$ cat rf
rank 0=n1 slot=0:0
rank 1=n1 slot=0:1
rank 2=n1 slot=0:2
rank 3=n1 slot=0:3
The 3s sequential become 1.66s in this node
$ mpirun -H n1 pi
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs Time (s) Error
1 1.662411 6.332712E-13
cpd@nodo0:~/ompi$ mpirun -H n1 -mca mpi_paffinity_alone 1 -report-bindings -c 1
pi
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs Time (s) Error
1 1.663015 6.332712E-13
Hmf, didn't report the bindings since there are none, as I forgot to
mention the slots...
cpd@nodo0:~/ompi$ mpirun -H n1 -display-devel-map -mca mpi_paffinity_alone 1
-report-bindings -c 1 pi
Map generated by mapping policy: 0402
Npernode: 0 Oversubscribe allowed: TRUE CPU Lists: FALSE
...
Data for node: Name: n1 Launch id: -1 Arch: 0 State: 2
Num boards: 1 Num sockets/board: 1 Num cores/socket: 2
...
Num slots: 1 Slots in use: 1
Num slots allocated: 1 Max slots: 0
...
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs Time (s) Error
1 1.663060 6.332712E-13
...I forgot to use the hostfile to mention the slots, so ompi thinks I'm
oversubscribing a 1-slot node. It also thinks this is a dual-core node,
like the one I'm typing mpirun on. I could fix the core count from
command line, but not the oversubscription (and consequently degraded
mode). 1.66s become 1.84s with 2 ranks. Hmm, now I think of it, it's 1s
diff to the expected 0.84s...
$ mpirun -H n1 -num-cores 4 -display-devel-map -mca mpi_paffinity_alone 1
-report-bindings -c 2 pi
Map generated by mapping policy: 0402
Npernode: 0 Oversubscribe allowed: TRUE CPU Lists: FALSE
...
Data for node: Name: n1 Launch id: -1 Arch: 0 State: 2
Num boards: 1 Num sockets/board: 1 Num cores/socket: 4
...
Num slots: 1 Slots in use: 2
Num slots allocated: 1 Max slots: 0
...
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs Time (s) Error
2 1.843984 2.291500E-13
It's indeed oversubscribed
$ mpirun -H n1 -nooversubscribe -num-cores 4 -display-devel-map -mca
mpi_paffinity_alone 1 -report-bindings -c 2 pi
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2 slots
that were requested by the application:
So here goes the hostfile
cpd@nodo0:~/ompi$ cat hf
n1 slots=4 max-slots=4
cpd@nodo0:~/ompi$ mpirun -hostfile hf -nooversubscribe -num-cores 4
-display-devel-map -mca mpi_paffinity_alone 1 -report-bindings -c 2 pi
Map generated by mapping policy: 0402
Npernode: 0 Oversubscribe allowed: FALSE CPU Lists: FALSE
...
Data for node: Name: n1 Launch id: -1 Arch: 0 State: 0
Num boards: 1 Num sockets/board: 1 Num cores/socket: 4
...
Num slots: 4 Slots in use: 2
Num slots allocated: 4 Max slots: 4
...
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs Time (s) Error
2 1.845798 2.291500E-13
Ouch! Where are my bindings? This did work for node n0!!!
cpd@nodo0:~/ompi$ cat hf1
n1 slots=4 max-slots=4
cpd@nodo0:~/ompi$ cat hf0
n0 slots=2 max-slots=2
cpd@nodo0:~/ompi$ mpirun -hostfile hf0 -mca mpi_paffinity_alone 1
-report-bindings -c 2 pi
[nodo0:03202] [[51720,0],0] odls:default:fork binding child [[51720,1],0] to
cpus 0001
[nodo0:03202] [[51720,0],0] odls:default:fork binding child [[51720,1],1] to
cpus 0002
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs Time (s) Error
2 1.529327 2.291500E-13
cpd@nodo0:~/ompi$ mpirun -hostfile hf1 -mca mpi_paffinity_alone 1
-report-bindings -c 2 pi
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs Time (s) Error
2 1.842951 2.291500E-13
Ok, I'm moving to node n1
cpd@nodo0:~/ompi$ ssh n1
cpd@nodo1:~$ cd ompi
cpd@nodo1:~/ompi$ mpirun -hostfile hf1 -mca mpi_paffinity_alone 1
-report-bindings -c 2 pi
[nodo1:02772] [[8858,0],0] odls:default:fork binding child [[8858,1],0] to cpus
0001
[nodo1:02772] [[8858,0],0] odls:default:fork binding child [[8858,1],1] to cpus
0002
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs Time (s) Error
2 1.841935 2.291500E-13
cpd@nodo1:~/ompi$ cat hf1
n1 slots=4 max-slots=4
cpd@nodo1:~/ompi$ grep cores /proc/cpuinfo
cpu cores : 4
cpu cores : 4
cpu cores : 4
cpu cores : 4
cpd@nodo1:~/ompi$ mpirun pi
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs Time (s) Error
1 1.661314 6.332712E-13
Hmf, with 2 cores I get 1.84s 10/10 times. These results below are
10/10, perfectly reproducible without a glitch. -bind-to-core is
required to get -report-bindings to report something. Is it different
from -mca mpi_paffinity_alone 1? I find it easier to use (shorter to
type in).
-bind-to-core|--bind-to-core
Whether to bind processes to specific cores (the
default)
cpd@nodo1:~/ompi$ mpirun -hostfile hf1 -bind-to-core -report-bindings -c 2 pi
[nodo1:02836] [[9050,0],0] odls:default:fork binding child [[9050,1],0] to cpus
0001
[nodo1:02836] [[9050,0],0] odls:default:fork binding child [[9050,1],1] to cpus
0002
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs Time (s) Error
2 1.841275 2.291500E-13
cpd@nodo1:~/ompi$ mpirun -hostfile hf1 -bind-to-core -report-bindings -c 3 pi
[nodo1:02839] [[9049,0],0] odls:default:fork binding child [[9049,1],0] to cpus
0001
[nodo1:02839] [[9049,0],0] odls:default:fork binding child [[9049,1],1] to cpus
0002
[nodo1:02839] [[9049,0],0] odls:default:fork binding child [[9049,1],2] to cpus
0004
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs Time (s) Error
3 1.571870 3.606004E-13
cpd@nodo1:~/ompi$ mpirun -hostfile hf1 -bind-to-core -report-bindings -c 4 pi
[nodo1:02843] [[9045,0],0] odls:default:fork binding child [[9045,1],0] to cpus
0001
[nodo1:02843] [[9045,0],0] odls:default:fork binding child [[9045,1],1] to cpus
0002
[nodo1:02843] [[9045,0],0] odls:default:fork binding child [[9045,1],2] to cpus
0004
[nodo1:02843] [[9045,0],0] odls:default:fork binding child [[9045,1],3] to cpus
0008
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs Time (s) Error
4 1.436920 4.236611E-13
cpd@nodo1:~/ompi$ mpirun -hostfile hf1 -bind-to-core -report-bindings -c 5 pi
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 5 slots
that were requested by the application:
Hmm ,wait, I got this one on a second try (so it's really 1/20)
cpd@nodo1:~/ompi$ mpirun -hostfile hf1 -bind-to-core -report-bindings -c 4 pi
[nodo1:02934] [[9016,0],0] odls:default:fork binding child [[9016,1],0] to cpus
0001
[nodo1:02934] [[9016,0],0] odls:default:fork binding child [[9016,1],1] to cpus
0002
[nodo1:02934] [[9016,0],0] odls:default:fork binding child [[9016,1],2] to cpus
0004
[nodo1:02934] [[9016,0],0] odls:default:fork binding child [[9016,1],3] to cpus
0008
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs Time (s) Error
4 2.437651 4.236611E-13
Hmm, yup, I see it: again 1s diff with the expected 1.43. No clue.
The full devel-map for the -c 4 run (making sure with -nooversubscribe) is
cpd@nodo1:~/ompi$ mpirun -hostfile hf1 -bind-to-core -nooversubscribe
-report-bindings -display-devel-map -c 4 pi
Map generated by mapping policy: 0402
Npernode: 0 Oversubscribe allowed: FALSE CPU Lists: FALSE
Num new daemons: 0 New daemon starting vpid INVALID
Num nodes: 1
Data for node: Name: nodo1 Launch id: -1 Arch: ffc91200 State: 2
Num boards: 1 Num sockets/board: 1 Num cores/socket: 4
Daemon: [[9168,0],0] Daemon launched: True
Num slots: 4 Slots in use: 4
Num slots allocated: 4 Max slots: 4
Username on node: NULL
Num procs: 4 Next node_rank: 4
Data for proc: [[9168,1],0]
Pid: 0 Local rank: 0 Node rank: 0
State: 0 App_context: 0 Slot list: NULL
Data for proc: [[9168,1],1]
Pid: 0 Local rank: 1 Node rank: 1
State: 0 App_context: 0 Slot list: NULL
Data for proc: [[9168,1],2]
Pid: 0 Local rank: 2 Node rank: 2
State: 0 App_context: 0 Slot list: NULL
Data for proc: [[9168,1],3]
Pid: 0 Local rank: 3 Node rank: 3
State: 0 App_context: 0 Slot list: NULL
[nodo1:02974] [[9168,0],0] odls:default:fork binding child [[9168,1],0] to cpus
0001
[nodo1:02974] [[9168,0],0] odls:default:fork binding child [[9168,1],1] to cpus
0002
[nodo1:02974] [[9168,0],0] odls:default:fork binding child [[9168,1],2] to cpus
0004
[nodo1:02974] [[9168,0],0] odls:default:fork binding child [[9168,1],3] to cpus
0008
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs Time (s) Error
4 1.452407 4.236611E-13
And this is using the rankfile
cpd@nodo1:~/ompi$ cat rf
rank 0=n1 slot=0:0
rank 1=n1 slot=0:1
rank 2=n1 slot=0:2
rank 3=n1 slot=0:3
cpd@nodo1:~/ompi$ mpirun -rf rf -report-bindings -c 4 pi
[nodo1:03004] [[9202,0],0] odls:default:fork binding child [[9202,1],0] to
slot_list 0:0
[nodo1:03004] [[9202,0],0] odls:default:fork binding child [[9202,1],1] to
slot_list 0:1
[nodo1:03004] [[9202,0],0] odls:default:fork binding child [[9202,1],2] to
slot_list 0:2
[nodo1:03004] [[9202,0],0] odls:default:fork binding child [[9202,1],3] to
slot_list 0:3
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs Time (s) Error
4 1.435335 4.236611E-13
So I thought I'd better ask for help, and following mailing-list
experience, have yet another coffee meanwhile :-)
Are these results normal? Why can't I get results similar to the Pentium
dual on the Core i7? I would have expected to see some 0.84s with -c 2
every now and then, and very frequently if using -bind-to-core. I
realize I don't have a clear concept of when/wheter hostfile/slots are
required, or if -num-cores with -bind-to-core should have worked fine.
Is there any (mpirun switch-mca parameter) I'm missing that would
provide 10/10 results in the Pentium dual, and/or let the pi application
scale well on the Core i7?
Thanks in advance for any suggestions.