Am 02.02.2009 um 11:31 schrieb Sangamesh B:
On Mon, Feb 2, 2009 at 12:15 PM, Reuti <re...@staff.uni-marburg.de>
wrote:
Am 02.02.2009 um 05:44 schrieb Sangamesh B:
On Sun, Feb 1, 2009 at 10:37 PM, Reuti <re...@staff.uni-
marburg.de> wrote:
Am 01.02.2009 um 16:00 schrieb Sangamesh B:
On Sat, Jan 31, 2009 at 6:27 PM, Reuti <re...@staff.uni-
marburg.de>
wrote:
Am 31.01.2009 um 08:49 schrieb Sangamesh B:
On Fri, Jan 30, 2009 at 10:20 PM, Reuti <re...@staff.uni-
marburg.de>
wrote:
Am 30.01.2009 um 15:02 schrieb Sangamesh B:
Dear Open MPI,
Do you have a solution for the following problem of Open
MPI (1.3)
when run through Grid Engine.
I changed global execd params with H_MEMORYLOCKED=infinity and
restarted the sgeexecd in all nodes.
But still the problem persists:
$cat err.77.CPMD-OMPI
ssh_exchange_identification: Connection closed by remote host
I think this might already be the reason why it's not
working. A
mpihello
program is running fine through SGE?
No.
Any Open MPI parallel job thru SGE runs only if its running on a
single node (i.e. 8processes on 8 cores of a single node). If
number
of processes is more than 8, then SGE will schedule it on 2
nodes -
the job will fail with the above error.
Now I did a loose integration of Open MPI 1.3 with SGE. The
job runs,
but all 16 processes run on a single node.
What are the entries in `qconf -sconf`for:
rsh_command
rsh_daemon
$ qconf -sconf
global:
execd_spool_dir /opt/gridengine/default/spool
...
.....
qrsh_command /usr/bin/ssh
rsh_command /usr/bin/ssh
rlogin_command /usr/bin/ssh
rsh_daemon /usr/sbin/sshd
qrsh_daemon /usr/sbin/sshd
reprioritize 0
Do you must use ssh? Often in a private cluster the rsh based
one is ok,
or
with SGE 6.2 the built-in mechanism of SGE. Otherwise please
follow this:
http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html
I think its better to check once with Open MPI 1.2.8
What is your mpirun command in the jobscript - you are getting
there
the
mpirun from Open MPI? According to the output below, it's not
a loose
integration, but you prepare alraedy a machinefile, which is
superfluous
for
Open MPI.
No. I've not prepared the machinefile for Open MPI.
For Tight integartion job:
/opt/mpi/openmpi/1.3/intel/bin/mpirun -np $NSLOTS
$CPMDBIN/cpmd311-ompi-mkl.x wf1.in $PP_LIBRARY >
wf1.out_OMPI$NSLOTS.$JOB_ID
For loose integration job:
/opt/mpi/openmpi/1.3/intel/bin/mpirun -np $NSLOTS -hostfile
$TMPDIR/machines $CPMDBIN/cpmd311-ompi-mkl.x wf1.in
$PP_LIBRARY >
wf1.out_OMPI_$JOB_ID.$NSLOTS
a) you compiled Open MPI with "--with-sge"?
Yes. But ompi_info shows only one component of sge
$ /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine
MCA ras: gridengine (MCA v2.0, API v2.0,
Component v1.3)
b) when the $SGE_ROOT variable is set, Open MPI will use a Tight
Integration
automatically.
In SGE job submit script, I set SGE_ROOT= <nothing>
This will set the variable to an empty string. You need to use:
unset SGE_ROOT
Right.
I used 'unset SGE_ROOT' in the job submission script. Its working now.
Hello world jobs are working now. (single & multiple nodes)
Thank you for the help.
What can be the problem with tight integration?
There are obviously two issues for now with the Tight Integration for
SGE:
- Some processes might throw an "err=2" for unknown reason and only
from time to time, but run fine.
- Processes vanish into daemon although SGE's qrsh is used
automatically (successive `ps -e f`show that it's called with "...
orted --daemonize ..." for a short while) - this I overlooked in my
last post when I stated it's working, as my process allocation was
fine. Only that they weren't bound to any sge_shepherd.
Seems SGE integration is broken, and it would be indeed better to
stay with 1.2.8 for now :-/
-- Reuti