I assume it is working with np=8 because the 8 processes are getting launched on the same node as mpirun and therefore there is no call to qrsh to start up any remote processes. When you go beyond 8, mpirun calls qrsh to start up processes on some of the remote nodes.

I would suggest first that you replace your MPI program with just hostname to simplify debug. Then maybe you can forward along your qsub script as well as what your PE environment looks like (qconf -sp PE_NAME --- where PE_NAME is the name of your parallel environemnt).

Rolf

Eli Morris wrote:
Hi guys,

I'm trying to run an example program, mpi-ring, on a rocks cluster. When launched via sge with 8 processors (we have 8 procs per node), the program works fine, but with any more processors and the program fails. I'm using open-mpi 1.3.2, included below, at end of post, is output of ompi_info -all

Any help with this vexing problem is appreciated.

thanks,

Eli

[emorris@nimbus ~/test]$ echo $LD_LIBRARY_PATH
/opt/openmpi/lib:/lib:/usr/lib:/share/apps/sunstudio/rtlibs
[emorris@nimbus ~/test]$ echo $PATH
/opt/openmpi/bin:/share/apps/sunstudio/bin:/opt/ncl/bin:/home/tobrien/scripts:/usr/java/latest/bin:/opt/local/grads/bin:/share/apps/openmpilib/bin:/opt/local/ncl/ncl/bin:/opt/gridengine/bin/lx26-amd64:/usr/java/latest/bin:/opt/gridengine/bin/lx26-amd64:/usr/kerberos/bin:/opt/gridengine/bin/lx26-amd64:/usr/java/latest/bin:/usr/local/bin:/bin:/usr/bin:/opt/eclipse:/opt/ganglia/bin:/opt/ganglia/sbin:/opt/maven/bin:/opt/openmpi/bin/:/opt/rocks/bin:/opt/rocks/sbin:/home/emorris/.sage/bin:/opt/eclipse:/opt/ganglia/bin:/opt/ganglia/sbin:/opt/maven/bin:/opt/openmpi/bin/:/opt/rocks/bin:/opt/rocks/sbin:/home/emorris/.sage/bin
[emorris@nimbus ~/test]$

Here is the mpirun command from the script:

/opt/openmpi/bin/mpirun --debug-daemons --mca plm_base_verbose 40 -mca plm_rsh_agent ssh -np $NSLOTS $HOME/test/mpi-ring

Here is the verbose output of a successful program start and failure:



Success:

[root@nimbus test]# more mpi-ring.qsub.o246
[compute-0-11.local:32126] mca: base: components_open: Looking for plm components [compute-0-11.local:32126] mca: base: components_open: opening plm components [compute-0-11.local:32126] mca: base: components_open: found loaded component rsh [compute-0-11.local:32126] mca: base: components_open: component rsh has no register function [compute-0-11.local:32126] mca: base: components_open: component rsh open function successful [compute-0-11.local:32126] mca: base: components_open: found loaded component slurm [compute-0-11.local:32126] mca: base: components_open: component slurm has no register function [compute-0-11.local:32126] mca: base: components_open: component slurm open function successful
[compute-0-11.local:32126] mca:base:select: Auto-selecting plm components
[compute-0-11.local:32126] mca:base:select:( plm) Querying component [rsh] [compute-0-11.local:32126] [[INVALID],INVALID] plm:rsh: using /opt/gridengine/bin/lx26-amd64/qrsh for launching [compute-0-11.local:32126] mca:base:select:( plm) Query of component [rsh] set priority to 10 [compute-0-11.local:32126] mca:base:select:( plm) Querying component [slurm] [compute-0-11.local:32126] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [compute-0-11.local:32126] mca:base:select:( plm) Selected component [rsh]
[compute-0-11.local:32126] mca: base: close: component slurm closed
[compute-0-11.local:32126] mca: base: close: unloading component slurm
[compute-0-11.local:32126] [[22715,0],0] node[0].name compute-0-11 daemon 0 arch ffc91200 [compute-0-11.local:32126] [[22715,0],0] orted_cmd: received add_local_procs [compute-0-11.local:32126] [[22715,0],0] orted_recv: received sync+nidmap from local proc [[22715,1],1] [compute-0-11.local:32126] [[22715,0],0] orted_recv: received sync+nidmap from local proc [[22715,1],0] [compute-0-11.local:32126] [[22715,0],0] orted_cmd: received collective data cmd [compute-0-11.local:32126] [[22715,0],0] orted_cmd: received collective data cmd
.
.
.

failure:

[root@nimbus test]# more mpi-ring.qsub.o244
[compute-0-14.local:31175] mca:base:select:( plm) Querying component [rsh] [compute-0-14.local:31175] [[INVALID],INVALID] plm:rsh: using /opt/gridengine/bin/lx26-amd64/qrsh for launc
hing
[compute-0-14.local:31175] mca:base:select:( plm) Query of component [rsh] set priority to 10 [compute-0-14.local:31175] mca:base:select:( plm) Querying component [slurm] [compute-0-14.local:31175] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a mod
ule
[compute-0-14.local:31175] mca:base:select:( plm) Selected component [rsh]
Starting server daemon at host "compute-0-6.local"
Server daemon successfully started with task id "1.compute-0-6"
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
--------------------------------------------------------------------------
A daemon (pid 31176) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.


[...snip...]


--

=========================
rolf.vandeva...@sun.com
781-442-3043
=========================

Reply via email to