I normally hide my eyes when rankfiles appear, but since you provide so much help on this list yourself... :-)
I believe the problem is that you have the keyword "slots" wrong - it is supposed to be "slot": rank 1=host1 slot=1:0,1 rank 0=host2 slot=0:* rank 2=host4 slot=1-2 rank 3=host3 slot=0:1,1:0-2 Hence the flex parser gets confused... I didn't write this code, but it seems to me that a little more leeway (e.g., allowing "slots" as well as "slot") would be more appropriate. If you try the revision and it works, I'll submit a change to accept both syntax options. On Jul 26, 2011, at 2:49 PM, Gus Correa wrote: > Dear Open MPI pros > > I am having trouble to get the mpiexec rankfile option right. > I would appreciate any help to solve the problem. > > Also is there a way to tell Open MPI to print out its own numbering > of the "slots", and perhaps how they're mapped to the socket:core pair? > > I am using Open MPI 1.4.3, compiled with Torque 2.4.11 support, > on Linux CentOS 5.2 x86_64. > This cluster has nodes with dual AMD Opteron quad-core processors, > a total of 8 cores per node. > I enclose a snippet of /proc/cpuinfo below. > > I build the rankfile on the fly from the $PBS_NODEFILE. > The mpiexec command line is: > > mpiexec \ > -v \ > -np ${NP} \ > -mca btl openib,sm,self \ > -tag-output \ > -report-bindings \ > -rf $my_rf \ > -mca paffinity_base_verbose 1 \ > connectivity_c -v > > > I tried two different ways to specify the slots on the rankfile: > > *First way (sequential "slots" on each node): > > rank 0=node34 slots=0 > rank 1=node34 slots=1 > rank 2=node34 slots=2 > rank 3=node34 slots=3 > rank 4=node34 slots=4 > rank 5=node34 slots=5 > rank 6=node34 slots=6 > rank 7=node34 slots=7 > rank 8=node33 slots=0 > rank 9=node33 slots=1 > rank 10=node33 slots=2 > rank 11=node33 slots=3 > rank 12=node33 slots=4 > rank 13=node33 slots=5 > rank 14=node33 slots=6 > rank 15=node33 slots=7 > > > *Second way ( slots in socket:core style) : > > rank 0=node34 slots=0:0 > rank 1=node34 slots=0:1 > rank 2=node34 slots=0:2 > rank 3=node34 slots=0:3 > rank 4=node34 slots=1:0 > rank 5=node34 slots=1:1 > rank 6=node34 slots=1:2 > rank 7=node34 slots=1:3 > rank 8=node33 slots=0:0 > rank 9=node33 slots=0:1 > rank 10=node33 slots=0:2 > rank 11=node33 slots=0:3 > rank 12=node33 slots=1:0 > rank 13=node33 slots=1:1 > rank 14=node33 slots=1:2 > rank 15=node33 slots=1:3 > > *** > > I get the errors messages below. > I am scratching my head to full baldness to try to understand them. > > They seem to suggest that my rankfile syntax is wrong > (which I copied from the FAQ and man mpiexec), or that it is not parsing it > as I expected it to be. > Or is it perhaps that it doesn't like the numbers I am using for the > various slots in the rankfile? > The error messages also complaint about > node allocation or oversubscribed slots, > but the nodes were allocated by Torque, and the rankfiles were > written with no intent to oversubscribe. > > *First rankfile error: > > -------------------------------------------------------------------------- > Rankfile claimed host 0 that was not allocated or oversubscribed it's slots. > Please review your rank-slot assignments and your host allocation to ensure > a proper match. > > -------------------------------------------------------------------------- > - > > ... etc, etc ... > > *Second rankfile error: > > -------------------------------------------------------------------------- > Rankfile claimed host 0:0 that was not allocated or oversubscribed it's slots. > Please review your rank-slot assignments and your host allocation to ensure > a proper match. > > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > A daemon (pid unknown) died unexpectedly on signal 1 while attempting to > launch so we are aborting. > > ... etc, etc ... > > ********** > > I am stuck. > Any help is much appreciated. > Thank you. > > Gus Correa > > > > ***************************** > Snippet of /proc/cpuinfo > ***************************** > > processor : 0 > physical id : 0 > core id : 0 > siblings : 4 > cpu cores : 4 > > processor : 1 > physical id : 0 > core id : 1 > siblings : 4 > cpu cores : 4 > > processor : 2 > physical id : 0 > core id : 2 > siblings : 4 > cpu cores : 4 > > processor : 3 > physical id : 0 > core id : 3 > siblings : 4 > cpu cores : 4 > > processor : 4 > physical id : 1 > core id : 0 > siblings : 4 > cpu cores : 4 > > processor : 5 > physical id : 1 > core id : 1 > siblings : 4 > cpu cores : 4 > > processor : 6 > physical id : 1 > core id : 2 > siblings : 4 > cpu cores : 4 > > processor : 7 > physical id : 1 > core id : 3 > siblings : 4 > cpu cores : 4 > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users