Re: [OMPI users] problem w sge 6.2 & openmpi

Rolf Vandevaart Wed, 5 Aug 2009 16:52:52 -0400

I assume it is working with np=8 because the 8 processes are gettinglaunched on the same node as mpirun and therefore there is no call toqrsh to start up any remote processes. When you go beyond 8, mpiruncalls qrsh to start up processes on some of the remote nodes.

I would suggest first that you replace your MPI program with justhostname to simplify debug. Then maybe you can forward along your qsubscript as well as what your PE environment looks like (qconf -sp PE_NAME--- where PE_NAME is the name of your parallel environemnt).


Rolf

Eli Morris wrote:

Hi guys,
I'm trying to run an example program, mpi-ring, on a rocks cluster.When launched via sge with 8 processors (we have 8 procs per node),the program works fine, but with any more processors and the programfails.I'm using open-mpi 1.3.2, included below, at end of post, is output ofompi_info -all
Any help with this vexing problem is appreciated.

thanks,

Eli

[emorris@nimbus ~/test]$ echo $LD_LIBRARY_PATH
/opt/openmpi/lib:/lib:/usr/lib:/share/apps/sunstudio/rtlibs
[emorris@nimbus ~/test]$ echo $PATH
/opt/openmpi/bin:/share/apps/sunstudio/bin:/opt/ncl/bin:/home/tobrien/scripts:/usr/java/latest/bin:/opt/local/grads/bin:/share/apps/openmpilib/bin:/opt/local/ncl/ncl/bin:/opt/gridengine/bin/lx26-amd64:/usr/java/latest/bin:/opt/gridengine/bin/lx26-amd64:/usr/kerberos/bin:/opt/gridengine/bin/lx26-amd64:/usr/java/latest/bin:/usr/local/bin:/bin:/usr/bin:/opt/eclipse:/opt/ganglia/bin:/opt/ganglia/sbin:/opt/maven/bin:/opt/openmpi/bin/:/opt/rocks/bin:/opt/rocks/sbin:/home/emorris/.sage/bin:/opt/eclipse:/opt/ganglia/bin:/opt/ganglia/sbin:/opt/maven/bin:/opt/openmpi/bin/:/opt/rocks/bin:/opt/rocks/sbin:/home/emorris/.sage/bin
[emorris@nimbus ~/test]$

Here is the mpirun command from the script:
/opt/openmpi/bin/mpirun --debug-daemons --mca plm_base_verbose 40 -mcaplm_rsh_agent ssh -np $NSLOTS $HOME/test/mpi-ring
Here is the verbose output of a successful program start and failure:



Success:

[root@nimbus test]# more mpi-ring.qsub.o246
[compute-0-11.local:32126] mca: base: components_open: Looking for plmcomponents[compute-0-11.local:32126] mca: base: components_open: opening plmcomponents[compute-0-11.local:32126] mca: base: components_open: found loadedcomponent rsh[compute-0-11.local:32126] mca: base: components_open: component rshhas no register function[compute-0-11.local:32126] mca: base: components_open: component rshopen function successful[compute-0-11.local:32126] mca: base: components_open: found loadedcomponent slurm[compute-0-11.local:32126] mca: base: components_open: component slurmhas no register function[compute-0-11.local:32126] mca: base: components_open: component slurmopen function successful
[compute-0-11.local:32126] mca:base:select: Auto-selecting plm components
[compute-0-11.local:32126] mca:base:select:( plm) Querying component[rsh][compute-0-11.local:32126] [[INVALID],INVALID] plm:rsh: using/opt/gridengine/bin/lx26-amd64/qrsh for launching[compute-0-11.local:32126] mca:base:select:( plm) Query of component[rsh] set priority to 10[compute-0-11.local:32126] mca:base:select:( plm) Querying component[slurm][compute-0-11.local:32126] mca:base:select:( plm) Skipping component[slurm]. Query failed to return a module[compute-0-11.local:32126] mca:base:select:( plm) Selected component[rsh]
[compute-0-11.local:32126] mca: base: close: component slurm closed
[compute-0-11.local:32126] mca: base: close: unloading component slurm
[compute-0-11.local:32126] [[22715,0],0] node[0].name compute-0-11daemon 0 arch ffc91200[compute-0-11.local:32126] [[22715,0],0] orted_cmd: receivedadd_local_procs[compute-0-11.local:32126] [[22715,0],0] orted_recv: receivedsync+nidmap from local proc [[22715,1],1][compute-0-11.local:32126] [[22715,0],0] orted_recv: receivedsync+nidmap from local proc [[22715,1],0][compute-0-11.local:32126] [[22715,0],0] orted_cmd: receivedcollective data cmd[compute-0-11.local:32126] [[22715,0],0] orted_cmd: receivedcollective data cmd
.
.
.

failure:

[root@nimbus test]# more mpi-ring.qsub.o244
[compute-0-14.local:31175] mca:base:select:( plm) Querying component[rsh][compute-0-14.local:31175] [[INVALID],INVALID] plm:rsh: using/opt/gridengine/bin/lx26-amd64/qrsh for launc
hing
[compute-0-14.local:31175] mca:base:select:( plm) Query of component[rsh] set priority to 10[compute-0-14.local:31175] mca:base:select:( plm) Querying component[slurm][compute-0-14.local:31175] mca:base:select:( plm) Skipping component[slurm]. Query failed to return a mod
ule
[compute-0-14.local:31175] mca:base:select:( plm) Selected component[rsh]
Starting server daemon at host "compute-0-6.local"
Server daemon successfully started with task id "1.compute-0-6"
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
--------------------------------------------------------------------------
A daemon (pid 31176) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to havethe
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.


[...snip...]



--

=========================
rolf.vandeva...@sun.com
781-442-3043
=========================

Re: [OMPI users] problem w sge 6.2 & openmpi

Reply via email to