On Jul 26, 2011, at 3:56 PM, Gus Correa wrote: > Thank you very much, Ralph. > > Heck, it had to be something stupid like this. > Sorry for taking your time. > Yes, switching from "slots" to "slot" fixes the rankfile problem, > and both cases work. > > I must have been carried along by the hostfile syntax, > where the "slots" reign, but when it comes to binding, > obviously for each process rank one wants a single "slot" > (unless the process is multi-threaded, which is what I need to setup). > > I will write 100 times in the blackboard: > "Slots in the hostfile, slot in the rankfile, > slot is singular, to err is plural."
LOL > ... at least until Ralph's new plural-forgiving parsing rule > makes it to the code. Committed to the trunk, in the queue for both 1.4.4 and 1.5.4. > > Regards, > Gus Correa > > > > > Ralph Castain wrote: >> I normally hide my eyes when rankfiles appear, > but since you provide so much help on this list yourself... :-) >> I believe the problem is that you have the keyword "slots" wrong - > it is supposed to be "slot": >> rank 1=host1 slot=1:0,1 >> rank 0=host2 slot=0:* >> rank 2=host4 slot=1-2 >> rank 3=host3 slot=0:1,1:0-2 >> Hence the flex parser gets confused... >> I didn't write this code, but it seems to me that a little more leeway >> (e.g., allowing "slots" as well as "slot") would be more appropriate. If you >> try the revision and it works, I'll submit a change to accept both syntax >> options. >> On Jul 26, 2011, at 2:49 PM, Gus Correa wrote: >>> Dear Open MPI pros >>> >>> I am having trouble to get the mpiexec rankfile option right. >>> I would appreciate any help to solve the problem. >>> >>> Also is there a way to tell Open MPI to print out its own numbering >>> of the "slots", and perhaps how they're mapped to the socket:core pair? >>> >>> I am using Open MPI 1.4.3, compiled with Torque 2.4.11 support, >>> on Linux CentOS 5.2 x86_64. >>> This cluster has nodes with dual AMD Opteron quad-core processors, >>> a total of 8 cores per node. >>> I enclose a snippet of /proc/cpuinfo below. >>> >>> I build the rankfile on the fly from the $PBS_NODEFILE. >>> The mpiexec command line is: >>> >>> mpiexec \ >>> -v \ >>> -np ${NP} \ >>> -mca btl openib,sm,self \ >>> -tag-output \ >>> -report-bindings \ >>> -rf $my_rf \ >>> -mca paffinity_base_verbose 1 \ >>> connectivity_c -v >>> >>> >>> I tried two different ways to specify the slots on the rankfile: >>> >>> *First way (sequential "slots" on each node): >>> >>> rank 0=node34 slots=0 >>> rank 1=node34 slots=1 >>> rank 2=node34 slots=2 >>> rank 3=node34 slots=3 >>> rank 4=node34 slots=4 >>> rank 5=node34 slots=5 >>> rank 6=node34 slots=6 >>> rank 7=node34 slots=7 >>> rank 8=node33 slots=0 >>> rank 9=node33 slots=1 >>> rank 10=node33 slots=2 >>> rank 11=node33 slots=3 >>> rank 12=node33 slots=4 >>> rank 13=node33 slots=5 >>> rank 14=node33 slots=6 >>> rank 15=node33 slots=7 >>> >>> >>> *Second way ( slots in socket:core style) : >>> >>> rank 0=node34 slots=0:0 >>> rank 1=node34 slots=0:1 >>> rank 2=node34 slots=0:2 >>> rank 3=node34 slots=0:3 >>> rank 4=node34 slots=1:0 >>> rank 5=node34 slots=1:1 >>> rank 6=node34 slots=1:2 >>> rank 7=node34 slots=1:3 >>> rank 8=node33 slots=0:0 >>> rank 9=node33 slots=0:1 >>> rank 10=node33 slots=0:2 >>> rank 11=node33 slots=0:3 >>> rank 12=node33 slots=1:0 >>> rank 13=node33 slots=1:1 >>> rank 14=node33 slots=1:2 >>> rank 15=node33 slots=1:3 >>> >>> *** >>> >>> I get the errors messages below. >>> I am scratching my head to full baldness to try to understand them. >>> >>> They seem to suggest that my rankfile syntax is wrong >>> (which I copied from the FAQ and man mpiexec), or that it is not parsing it >>> as I expected it to be. >>> Or is it perhaps that it doesn't like the numbers I am using for the >>> various slots in the rankfile? >>> The error messages also complaint about >>> node allocation or oversubscribed slots, >>> but the nodes were allocated by Torque, and the rankfiles were >>> written with no intent to oversubscribe. >>> >>> *First rankfile error: >>> >>> -------------------------------------------------------------------------- >>> Rankfile claimed host 0 that was not allocated or oversubscribed it's slots. >>> Please review your rank-slot assignments and your host allocation to ensure >>> a proper match. >>> >>> -------------------------------------------------------------------------- >>> - >>> >>> ... etc, etc ... >>> >>> *Second rankfile error: >>> >>> -------------------------------------------------------------------------- >>> Rankfile claimed host 0:0 that was not allocated or oversubscribed it's >>> slots. >>> Please review your rank-slot assignments and your host allocation to ensure >>> a proper match. >>> >>> -------------------------------------------------------------------------- >>> -------------------------------------------------------------------------- >>> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to >>> launch so we are aborting. >>> >>> ... etc, etc ... >>> >>> ********** >>> >>> I am stuck. >>> Any help is much appreciated. >>> Thank you. >>> >>> Gus Correa >>> >>> >>> >>> ***************************** >>> Snippet of /proc/cpuinfo >>> ***************************** >>> >>> processor : 0 >>> physical id : 0 >>> core id : 0 >>> siblings : 4 >>> cpu cores : 4 >>> >>> processor : 1 >>> physical id : 0 >>> core id : 1 >>> siblings : 4 >>> cpu cores : 4 >>> >>> processor : 2 >>> physical id : 0 >>> core id : 2 >>> siblings : 4 >>> cpu cores : 4 >>> >>> processor : 3 >>> physical id : 0 >>> core id : 3 >>> siblings : 4 >>> cpu cores : 4 >>> >>> processor : 4 >>> physical id : 1 >>> core id : 0 >>> siblings : 4 >>> cpu cores : 4 >>> >>> processor : 5 >>> physical id : 1 >>> core id : 1 >>> siblings : 4 >>> cpu cores : 4 >>> >>> processor : 6 >>> physical id : 1 >>> core id : 2 >>> siblings : 4 >>> cpu cores : 4 >>> >>> processor : 7 >>> physical id : 1 >>> core id : 3 >>> siblings : 4 >>> cpu cores : 4 >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users