Ralph Castain <r...@open-mpi.org> writes:

>> On Nov 13, 2014, at 3:36 PM, Dave Love <d.l...@liverpool.ac.uk> wrote:
>> 
>> Ralph Castain <r...@open-mpi.org> writes:
>> 
>>>>>> cn6050 16 par6.q@cn6050 <NULL>
>>>>>> cn6045 16 par6.q@cn6045 <NULL>
>>>> 
>>>> The above looks like the PE_HOSTFILE. So it should be 16 slots per node.
>>> 
>>> Hey Reuti
>>> 
>>> Is that the standard PE_HOSTFILE format? I’m looking at the ras/gridengine 
>>> module, and it looks like it is expecting a different format. I suspect 
>>> that is the problem
>> 
>> I should have said that the parsing code is OK, and it specifically
>> works with the above.  (It should probably be made more robust by
>> ensuring it reads to end-of-line, and preferably should interpret a
>> binding string as the fourth field.)
>
> Afraid I am confused - if we look at the user’s output from mpirun
> —display-allocation, you can see that we only got the first line in
> the above. We didn’t see the second node at all. So the parsing code
> is clearly not reading that file correctly, or they have some envar
> set that is telling us to ignore the second node somehow.
>
> What am I missing?

Well, you don't know what mpirun's environment looked like, other than
NSLOTS apparently being intact.  The output from mpirun was consistent
with clobbering the other variables Reuti listed (breaking the SGE
"tight integration"):

  $ cat STDIN.o$(qsub -pe mpi 32 -l p=16,h_rt=9 -terse -sync y | head -1)
  unset PE_HOSTFILE
  mpirun --np $NSLOTS --display-allocation true

  ======================   ALLOCATED NODES   ======================
        comp162: slots=16 max_slots=0 slots_inuse=0 state=UP
  =================================================================
  $ 

Reply via email to