Dear Open MPI pros
I am having trouble to get the mpiexec rankfile option right.
I would appreciate any help to solve the problem.
Also is there a way to tell Open MPI to print out its own numbering
of the "slots", and perhaps how they're mapped to the socket:core pair?
I am using Open MPI 1.4.3, compiled with Torque 2.4.11 support,
on Linux CentOS 5.2 x86_64.
This cluster has nodes with dual AMD Opteron quad-core processors,
a total of 8 cores per node.
I enclose a snippet of /proc/cpuinfo below.
I build the rankfile on the fly from the $PBS_NODEFILE.
The mpiexec command line is:
mpiexec \
-v \
-np ${NP} \
-mca btl openib,sm,self \
-tag-output \
-report-bindings \
-rf $my_rf \
-mca paffinity_base_verbose 1 \
connectivity_c -v
I tried two different ways to specify the slots on the rankfile:
*First way (sequential "slots" on each node):
rank 0=node34 slots=0
rank 1=node34 slots=1
rank 2=node34 slots=2
rank 3=node34 slots=3
rank 4=node34 slots=4
rank 5=node34 slots=5
rank 6=node34 slots=6
rank 7=node34 slots=7
rank 8=node33 slots=0
rank 9=node33 slots=1
rank 10=node33 slots=2
rank 11=node33 slots=3
rank 12=node33 slots=4
rank 13=node33 slots=5
rank 14=node33 slots=6
rank 15=node33 slots=7
*Second way ( slots in socket:core style) :
rank 0=node34 slots=0:0
rank 1=node34 slots=0:1
rank 2=node34 slots=0:2
rank 3=node34 slots=0:3
rank 4=node34 slots=1:0
rank 5=node34 slots=1:1
rank 6=node34 slots=1:2
rank 7=node34 slots=1:3
rank 8=node33 slots=0:0
rank 9=node33 slots=0:1
rank 10=node33 slots=0:2
rank 11=node33 slots=0:3
rank 12=node33 slots=1:0
rank 13=node33 slots=1:1
rank 14=node33 slots=1:2
rank 15=node33 slots=1:3
***
I get the errors messages below.
I am scratching my head to full baldness to try to understand them.
They seem to suggest that my rankfile syntax is wrong
(which I copied from the FAQ and man mpiexec), or that it is not parsing
it as I expected it to be.
Or is it perhaps that it doesn't like the numbers I am using for the
various slots in the rankfile?
The error messages also complaint about
node allocation or oversubscribed slots,
but the nodes were allocated by Torque, and the rankfiles were
written with no intent to oversubscribe.
*First rankfile error:
--------------------------------------------------------------------------
Rankfile claimed host 0 that was not allocated or oversubscribed it's slots.
Please review your rank-slot assignments and your host allocation to ensure
a proper match.
--------------------------------------------------------------------------
-
... etc, etc ...
*Second rankfile error:
--------------------------------------------------------------------------
Rankfile claimed host 0:0 that was not allocated or oversubscribed it's
slots.
Please review your rank-slot assignments and your host allocation to ensure
a proper match.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
launch so we are aborting.
... etc, etc ...
**********
I am stuck.
Any help is much appreciated.
Thank you.
Gus Correa
*****************************
Snippet of /proc/cpuinfo
*****************************
processor : 0
physical id : 0
core id : 0
siblings : 4
cpu cores : 4
processor : 1
physical id : 0
core id : 1
siblings : 4
cpu cores : 4
processor : 2
physical id : 0
core id : 2
siblings : 4
cpu cores : 4
processor : 3
physical id : 0
core id : 3
siblings : 4
cpu cores : 4
processor : 4
physical id : 1
core id : 0
siblings : 4
cpu cores : 4
processor : 5
physical id : 1
core id : 1
siblings : 4
cpu cores : 4
processor : 6
physical id : 1
core id : 2
siblings : 4
cpu cores : 4
processor : 7
physical id : 1
core id : 3
siblings : 4
cpu cores : 4