Hi, I just stumped into the following behavior of Open MPI 1.4.2. Used jobscript:
*** #!/bin/sh export PATH=~/local/openmpi-1.4.2/bin:$PATH cat $PE_HOSTFILE mpiexec ./dummy.sh *** with dummy.sh: *** #!/bin/sh env | grep TMPDIR sleep 30 *** === Situation 1: getting 4 slots in total from 2 queues on 2 nodes. Output: pc15381 1 extra.q@pc15381 UNDEFINED pc15370 1 extra.q@pc15370 UNDEFINED pc15381 1 all.q@pc15381 UNDEFINED pc15370 1 all.q@pc15370 UNDEFINED TMPDIR=/tmp/1888.1.extra.q TMPDIR=/tmp/1888.1.extra.q TMPDIR=/tmp/1888.1.extra.q The slot of the master is in the first line of the PE_HOSTFILE. The job runs on pc15381, with one local fork of dummy.sh and doing two times a `qrsh -inherit` from pc15381 to pc15370 (checked with `ps -e f`). So only 3 instances are running, instead of four. === Situation 2: getting 4 slots in total from 2 queues on one and the same node. pc15370 2 all.q@pc15370 UNDEFINED pc15370 2 extra.q@pc15370 UNDEFINED TMPDIR=/tmp/1889.1.all.q TMPDIR=/tmp/1889.1.all.q It looks like for the master node of the parallel job, always only one entry of the PE_HOSTFILE is honored. So 2 processes are missing here. == So I see two isuses: (1) Number of started tasks is wrong. I'm not sure, whether the correct behavior should be: a) add up all slots for each machine, also for the master node of the job, and fork this number of slots b) fork only the slots mentioned for the master queue of the job, and make a local `qrsh -inherit` for the slots running in a different queue on the same host. So the third column of the PE_HOSTFILE should be honored too. (2) In situation 1: from the example, one slot on pc15370 should run in all.q and get an appropriate $TMPDIR. This is of course a bug in SGE, which I will investigate on the SGE list. -- Reuti