Re: [OMPI users] Fwd: [GE users] Open MPI job fails when run thru SGE

Reuti Mon, 2 Feb 2009 06:12:04 -0500

Am 02.02.2009 um 11:31 schrieb Sangamesh B:

On Mon, Feb 2, 2009 at 12:15 PM, Reuti <re...@staff.uni-marburg.de>wrote:
Am 02.02.2009 um 05:44 schrieb Sangamesh B:
On Sun, Feb 1, 2009 at 10:37 PM, Reuti <re...@staff.uni-marburg.de> wrote:
Am 01.02.2009 um 16:00 schrieb Sangamesh B:
On Sat, Jan 31, 2009 at 6:27 PM, Reuti <re...@staff.uni-marburg.de>
wrote:
Am 31.01.2009 um 08:49 schrieb Sangamesh B:
On Fri, Jan 30, 2009 at 10:20 PM, Reuti <re...@staff.uni-marburg.de>
wrote:
Am 30.01.2009 um 15:02 schrieb Sangamesh B:
Dear Open MPI,
Do you have a solution for the following problem of OpenMPI (1.3)
when run through Grid Engine.

I changed global execd params with H_MEMORYLOCKED=infinity and
restarted the sgeexecd in all nodes.

But still the problem persists:

 $cat err.77.CPMD-OMPI
ssh_exchange_identification: Connection closed by remote host
I think this might already be the reason why it's notworking. A
mpihello
program is running fine through SGE?
No.

Any Open MPI parallel job thru SGE runs only if its running on a
single node (i.e. 8processes on 8 cores of a single node). Ifnumberof processes is more than 8, then SGE will schedule it on 2nodes -
the job will fail with the above error.
Now I did a loose integration of Open MPI 1.3 with SGE. Thejob runs,
but all 16 processes run on a single node.
What are the entries in `qconf -sconf`for:

rsh_command
rsh_daemon
$ qconf -sconf
global:
execd_spool_dir              /opt/gridengine/default/spool
...
.....
qrsh_command                 /usr/bin/ssh
rsh_command                  /usr/bin/ssh
rlogin_command               /usr/bin/ssh
rsh_daemon                   /usr/sbin/sshd
qrsh_daemon                  /usr/sbin/sshd
reprioritize                 0
Do you must use ssh? Often in a private cluster the rsh basedone is ok,
or
with SGE 6.2 the built-in mechanism of SGE. Otherwise pleasefollow this:
http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html
I think its better to check once with Open MPI 1.2.8
What is your mpirun command in the jobscript - you are gettingthere
the
mpirun from Open MPI? According to the output below, it's nota loose
integration, but you prepare alraedy a machinefile, which is
superfluous
for
Open MPI.
No. I've not prepared the machinefile for Open MPI.
For Tight integartion job:

/opt/mpi/openmpi/1.3/intel/bin/mpirun -np $NSLOTS
$CPMDBIN/cpmd311-ompi-mkl.x  wf1.in $PP_LIBRARY >
wf1.out_OMPI$NSLOTS.$JOB_ID

For loose integration job:

/opt/mpi/openmpi/1.3/intel/bin/mpirun -np $NSLOTS -hostfile
$TMPDIR/machines $CPMDBIN/cpmd311-ompi-mkl.x wf1.in$PP_LIBRARY >
wf1.out_OMPI_$JOB_ID.$NSLOTS
a) you compiled Open MPI with "--with-sge"?
Yes. But ompi_info shows only one component of sge

$ /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine
MCA ras: gridengine (MCA v2.0, API v2.0,Component v1.3)
b) when the $SGE_ROOT variable is set, Open MPI will use a Tight
Integration
automatically.
In SGE job submit script, I set SGE_ROOT= <nothing>
This will set the variable to an empty string. You need to use:

unset SGE_ROOT
Right.
I used 'unset SGE_ROOT' in the job submission script. Its working now.
Hello world jobs are working now. (single & multiple nodes)

Thank you for the help.

What can be the problem with tight integration?

There are obviously two issues for now with the Tight Integration forSGE:

- Some processes might throw an "err=2" for unknown reason and onlyfrom time to time, but run fine.

- Processes vanish into daemon although SGE's qrsh is usedautomatically (successive `ps -e f`show that it's called with "...orted --daemonize ..." for a short while) - this I overlooked in mylast post when I stated it's working, as my process allocation wasfine. Only that they weren't bound to any sge_shepherd.

Seems SGE integration is broken, and it would be indeed better tostay with 1.2.8 for now :-/


-- Reuti

Re: [OMPI users] Fwd: [GE users] Open MPI job fails when run thru SGE

Reply via email to