I normally hide my eyes when rankfiles appear, but since you provide so much 
help on this list yourself... :-)

I believe the problem is that you have the keyword "slots" wrong - it is 
supposed to be "slot":

    rank 1=host1 slot=1:0,1
    rank 0=host2 slot=0:*
    rank 2=host4 slot=1-2
    rank 3=host3 slot=0:1,1:0-2

Hence the flex parser gets confused...

I didn't write this code, but it seems to me that a little more leeway (e.g., 
allowing "slots" as well as "slot") would be more appropriate. If you try the 
revision and it works, I'll submit a change to accept both syntax options.

On Jul 26, 2011, at 2:49 PM, Gus Correa wrote:

> Dear Open MPI pros
> 
> I am having trouble to get the mpiexec rankfile option right.
> I would appreciate any help to solve the problem.
> 
> Also is there a way to tell Open MPI to print out its own numbering
> of the "slots", and perhaps how they're mapped to the socket:core pair?
> 
> I am using Open MPI 1.4.3, compiled with Torque 2.4.11 support,
> on Linux CentOS 5.2 x86_64.
> This cluster has nodes with dual AMD Opteron quad-core processors,
> a total of 8 cores per node.
> I enclose a snippet of /proc/cpuinfo below.
> 
> I build the rankfile on the fly from the $PBS_NODEFILE.
> The mpiexec command line is:
> 
> mpiexec \
>        -v \
>       -np ${NP} \
>        -mca btl openib,sm,self \
>        -tag-output \
>        -report-bindings \
>        -rf $my_rf \
>       -mca paffinity_base_verbose 1 \
>        connectivity_c -v
> 
> 
> I tried two different ways to specify the slots on the rankfile:
> 
> *First way (sequential "slots" on each node):
> 
> rank   0=node34 slots=0
> rank   1=node34 slots=1
> rank   2=node34 slots=2
> rank   3=node34 slots=3
> rank   4=node34 slots=4
> rank   5=node34 slots=5
> rank   6=node34 slots=6
> rank   7=node34 slots=7
> rank   8=node33 slots=0
> rank   9=node33 slots=1
> rank  10=node33 slots=2
> rank  11=node33 slots=3
> rank  12=node33 slots=4
> rank  13=node33 slots=5
> rank  14=node33 slots=6
> rank  15=node33 slots=7
> 
> 
> *Second way ( slots in socket:core style) :
> 
> rank   0=node34 slots=0:0
> rank   1=node34 slots=0:1
> rank   2=node34 slots=0:2
> rank   3=node34 slots=0:3
> rank   4=node34 slots=1:0
> rank   5=node34 slots=1:1
> rank   6=node34 slots=1:2
> rank   7=node34 slots=1:3
> rank   8=node33 slots=0:0
> rank   9=node33 slots=0:1
> rank  10=node33 slots=0:2
> rank  11=node33 slots=0:3
> rank  12=node33 slots=1:0
> rank  13=node33 slots=1:1
> rank  14=node33 slots=1:2
> rank  15=node33 slots=1:3
> 
> ***
> 
> I get the errors messages below.
> I am scratching my head to full baldness to try to understand them.
> 
> They seem to suggest that my rankfile syntax is wrong
> (which I copied from the FAQ and man mpiexec), or that it is not parsing it 
> as I expected it to be.
> Or is it perhaps that it doesn't like the numbers I am using for the
> various slots in the rankfile?
> The error messages also complaint about
> node allocation or oversubscribed slots,
> but the nodes were allocated by Torque, and the rankfiles were
> written with no intent to oversubscribe.
> 
> *First rankfile error:
> 
> --------------------------------------------------------------------------
> Rankfile claimed host 0 that was not allocated or oversubscribed it's slots.
> Please review your rank-slot assignments and your host allocation to ensure
> a proper match.
> 
> --------------------------------------------------------------------------
> -
> 
> ... etc, etc ...
> 
> *Second rankfile error:
> 
> --------------------------------------------------------------------------
> Rankfile claimed host 0:0 that was not allocated or oversubscribed it's slots.
> Please review your rank-slot assignments and your host allocation to ensure
> a proper match.
> 
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> launch so we are aborting.
> 
> ... etc, etc ...
> 
> **********
> 
> I am stuck.
> Any help is much appreciated.
> Thank you.
> 
> Gus Correa
> 
> 
> 
> *****************************
> Snippet of /proc/cpuinfo
> *****************************
> 
> processor     : 0
> physical id   : 0
> core id               : 0
> siblings      : 4
> cpu cores     : 4
> 
> processor     : 1
> physical id   : 0
> core id               : 1
> siblings      : 4
> cpu cores     : 4
> 
> processor     : 2
> physical id   : 0
> core id               : 2
> siblings      : 4
> cpu cores     : 4
> 
> processor     : 3
> physical id   : 0
> core id               : 3
> siblings      : 4
> cpu cores     : 4
> 
> processor     : 4
> physical id   : 1
> core id               : 0
> siblings      : 4
> cpu cores     : 4
> 
> processor     : 5
> physical id   : 1
> core id               : 1
> siblings      : 4
> cpu cores     : 4
> 
> processor     : 6
> physical id   : 1
> core id               : 2
> siblings      : 4
> cpu cores     : 4
> 
> processor     : 7
> physical id   : 1
> core id               : 3
> siblings      : 4
> cpu cores     : 4
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to