On Tue, Jan 17, 2017 at 09:56:54AM -0800, r...@open-mpi.org wrote:
> As I recall, the problem was that qrsh isn???t available on the backend 
> compute nodes, and so we can???t use a tree for launch. If that isn???t true, 
> then we can certainly adjust it.
>
qrsh should be available on all nodes of a SoGE cluster but, depending on how 
things are set up, may not be 
findable (ie not in the PATH) when you qrsh -inherit into a node.  A workaround 
would be to start backend 
processes with qrsh -inherit -v PATH which will copy the PATH from the master 
node to the slave node 
process or otherwise pass the location of qrsh from one node or another.  That 
of course assumes that 
 qrsh is in the same location on all nodes.

I've tested that it is possible to qrsh from the head node of a job to a slave 
node and then on to
another slave node by this method.

William


> > On Jan 17, 2017, at 9:37 AM, Mark Dixon <m.c.di...@leeds.ac.uk> wrote:
> > 
> > Hi,
> > 
> > While commissioning a new cluster, I wanted to run HPL across the whole 
> > thing using openmpi 2.0.1.
> > 
> > I couldn't get it to start on more than 129 hosts under Son of Gridengine 
> > (128 remote plus the localhost running the mpirun command). openmpi would 
> > sit there, waiting for all the orted's to check in; however, there were 
> > "only" a maximum of 128 qrsh processes, therefore a maximum of 128 orted's, 
> > therefore waiting a loooong time.
> > 
> > Increasing plm_rsh_num_concurrent beyond the default of 128 gets the job to 
> > launch.
> > 
> > Is this intentional, please?
> > 
> > Doesn't openmpi use a tree-like startup sometimes - any particular reason 
> > it's not using it here?

Attachment: signature.asc
Description: Digital signature

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to