On Jul 26, 2011, at 3:56 PM, Gus Correa wrote:

> Thank you very much, Ralph.
> 
> Heck, it had to be something stupid like this.
> Sorry for taking your time.
> Yes, switching from "slots" to "slot" fixes the rankfile problem,
> and both cases work.
> 
> I must have been carried along by the hostfile syntax,
> where the "slots" reign, but when it comes to binding,
> obviously for each process rank one wants a single "slot"
> (unless the process is multi-threaded, which is what I need to setup).
> 
> I will write 100 times in the blackboard:
> "Slots in the hostfile, slot in the rankfile,
> slot is singular, to err is plural."

LOL

> ... at least until Ralph's new plural-forgiving parsing rule
> makes it to the code.

Committed to the trunk, in the queue for both 1.4.4 and 1.5.4.

> 
> Regards,
> Gus Correa
> 
> 
> 
> 
> Ralph Castain wrote:
>> I normally hide my eyes when rankfiles appear, 
> but since you provide so much help on this list yourself... :-)
>> I believe the problem is that you have the keyword "slots" wrong - 
> it is supposed to be "slot":
>>    rank 1=host1 slot=1:0,1
>>    rank 0=host2 slot=0:*
>>    rank 2=host4 slot=1-2
>>    rank 3=host3 slot=0:1,1:0-2
>> Hence the flex parser gets confused...
>> I didn't write this code, but it seems to me that a little more leeway 
>> (e.g., allowing "slots" as well as "slot") would be more appropriate. If you 
>> try the revision and it works, I'll submit a change to accept both syntax 
>> options.
>> On Jul 26, 2011, at 2:49 PM, Gus Correa wrote:
>>> Dear Open MPI pros
>>> 
>>> I am having trouble to get the mpiexec rankfile option right.
>>> I would appreciate any help to solve the problem.
>>> 
>>> Also is there a way to tell Open MPI to print out its own numbering
>>> of the "slots", and perhaps how they're mapped to the socket:core pair?
>>> 
>>> I am using Open MPI 1.4.3, compiled with Torque 2.4.11 support,
>>> on Linux CentOS 5.2 x86_64.
>>> This cluster has nodes with dual AMD Opteron quad-core processors,
>>> a total of 8 cores per node.
>>> I enclose a snippet of /proc/cpuinfo below.
>>> 
>>> I build the rankfile on the fly from the $PBS_NODEFILE.
>>> The mpiexec command line is:
>>> 
>>> mpiexec \
>>>       -v \
>>>     -np ${NP} \
>>>       -mca btl openib,sm,self \
>>>       -tag-output \
>>>       -report-bindings \
>>>       -rf $my_rf \
>>>     -mca paffinity_base_verbose 1 \
>>>       connectivity_c -v
>>> 
>>> 
>>> I tried two different ways to specify the slots on the rankfile:
>>> 
>>> *First way (sequential "slots" on each node):
>>> 
>>> rank   0=node34 slots=0
>>> rank   1=node34 slots=1
>>> rank   2=node34 slots=2
>>> rank   3=node34 slots=3
>>> rank   4=node34 slots=4
>>> rank   5=node34 slots=5
>>> rank   6=node34 slots=6
>>> rank   7=node34 slots=7
>>> rank   8=node33 slots=0
>>> rank   9=node33 slots=1
>>> rank  10=node33 slots=2
>>> rank  11=node33 slots=3
>>> rank  12=node33 slots=4
>>> rank  13=node33 slots=5
>>> rank  14=node33 slots=6
>>> rank  15=node33 slots=7
>>> 
>>> 
>>> *Second way ( slots in socket:core style) :
>>> 
>>> rank   0=node34 slots=0:0
>>> rank   1=node34 slots=0:1
>>> rank   2=node34 slots=0:2
>>> rank   3=node34 slots=0:3
>>> rank   4=node34 slots=1:0
>>> rank   5=node34 slots=1:1
>>> rank   6=node34 slots=1:2
>>> rank   7=node34 slots=1:3
>>> rank   8=node33 slots=0:0
>>> rank   9=node33 slots=0:1
>>> rank  10=node33 slots=0:2
>>> rank  11=node33 slots=0:3
>>> rank  12=node33 slots=1:0
>>> rank  13=node33 slots=1:1
>>> rank  14=node33 slots=1:2
>>> rank  15=node33 slots=1:3
>>> 
>>> ***
>>> 
>>> I get the errors messages below.
>>> I am scratching my head to full baldness to try to understand them.
>>> 
>>> They seem to suggest that my rankfile syntax is wrong
>>> (which I copied from the FAQ and man mpiexec), or that it is not parsing it 
>>> as I expected it to be.
>>> Or is it perhaps that it doesn't like the numbers I am using for the
>>> various slots in the rankfile?
>>> The error messages also complaint about
>>> node allocation or oversubscribed slots,
>>> but the nodes were allocated by Torque, and the rankfiles were
>>> written with no intent to oversubscribe.
>>> 
>>> *First rankfile error:
>>> 
>>> --------------------------------------------------------------------------
>>> Rankfile claimed host 0 that was not allocated or oversubscribed it's slots.
>>> Please review your rank-slot assignments and your host allocation to ensure
>>> a proper match.
>>> 
>>> --------------------------------------------------------------------------
>>> -
>>> 
>>> ... etc, etc ...
>>> 
>>> *Second rankfile error:
>>> 
>>> --------------------------------------------------------------------------
>>> Rankfile claimed host 0:0 that was not allocated or oversubscribed it's 
>>> slots.
>>> Please review your rank-slot assignments and your host allocation to ensure
>>> a proper match.
>>> 
>>> --------------------------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>>> launch so we are aborting.
>>> 
>>> ... etc, etc ...
>>> 
>>> **********
>>> 
>>> I am stuck.
>>> Any help is much appreciated.
>>> Thank you.
>>> 
>>> Gus Correa
>>> 
>>> 
>>> 
>>> *****************************
>>> Snippet of /proc/cpuinfo
>>> *****************************
>>> 
>>> processor   : 0
>>> physical id : 0
>>> core id             : 0
>>> siblings    : 4
>>> cpu cores   : 4
>>> 
>>> processor   : 1
>>> physical id : 0
>>> core id             : 1
>>> siblings    : 4
>>> cpu cores   : 4
>>> 
>>> processor   : 2
>>> physical id : 0
>>> core id             : 2
>>> siblings    : 4
>>> cpu cores   : 4
>>> 
>>> processor   : 3
>>> physical id : 0
>>> core id             : 3
>>> siblings    : 4
>>> cpu cores   : 4
>>> 
>>> processor   : 4
>>> physical id : 1
>>> core id             : 0
>>> siblings    : 4
>>> cpu cores   : 4
>>> 
>>> processor   : 5
>>> physical id : 1
>>> core id             : 1
>>> siblings    : 4
>>> cpu cores   : 4
>>> 
>>> processor   : 6
>>> physical id : 1
>>> core id             : 2
>>> siblings    : 4
>>> cpu cores   : 4
>>> 
>>> processor   : 7
>>> physical id : 1
>>> core id             : 3
>>> siblings    : 4
>>> cpu cores   : 4
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to