Hi,

I just stumped into the following behavior of Open MPI 1.4.2. Used jobscript:

***
#!/bin/sh
export PATH=~/local/openmpi-1.4.2/bin:$PATH
cat $PE_HOSTFILE
mpiexec ./dummy.sh
***

with dummy.sh:

***
#!/bin/sh
env | grep TMPDIR
sleep 30
***
===
Situation 1: getting 4 slots in total from 2 queues on 2 nodes. Output:

pc15381 1 extra.q@pc15381 UNDEFINED
pc15370 1 extra.q@pc15370 UNDEFINED
pc15381 1 all.q@pc15381 UNDEFINED
pc15370 1 all.q@pc15370 UNDEFINED
TMPDIR=/tmp/1888.1.extra.q
TMPDIR=/tmp/1888.1.extra.q
TMPDIR=/tmp/1888.1.extra.q

The slot of the master is in the first line of the PE_HOSTFILE. The job runs on 
pc15381, with one local fork of dummy.sh and doing two times a `qrsh -inherit` 
from pc15381 to pc15370 (checked with `ps -e f`). So only 3 instances are 
running, instead of four.

===
Situation 2: getting 4 slots in total from 2 queues on one and the same node.

pc15370 2 all.q@pc15370 UNDEFINED
pc15370 2 extra.q@pc15370 UNDEFINED
TMPDIR=/tmp/1889.1.all.q
TMPDIR=/tmp/1889.1.all.q

It looks like for the master node of the parallel job, always only one entry of 
the PE_HOSTFILE is honored. So 2 processes are missing here.

==

So I see two isuses:

(1) Number of started tasks is wrong. I'm not sure, whether the correct 
behavior should be:

a) add up all slots for each machine, also for the master node of the job, and 
fork this number of slots

b) fork only the slots mentioned for the master queue of the job, and make a 
local `qrsh -inherit` for the slots running in a different queue on the same 
host. So the third column of the PE_HOSTFILE should be honored too.


(2) In situation 1: from the example, one slot on pc15370 should run in all.q 
and get an appropriate $TMPDIR. This is of course a bug in SGE, which I will 
investigate on the SGE list.


-- Reuti


Reply via email to