As I recall, the problem was that qrsh isn’t available on the backend compute nodes, and so we can’t use a tree for launch. If that isn’t true, then we can certainly adjust it.
> On Jan 17, 2017, at 9:37 AM, Mark Dixon <m.c.di...@leeds.ac.uk> wrote: > > Hi, > > While commissioning a new cluster, I wanted to run HPL across the whole thing > using openmpi 2.0.1. > > I couldn't get it to start on more than 129 hosts under Son of Gridengine > (128 remote plus the localhost running the mpirun command). openmpi would sit > there, waiting for all the orted's to check in; however, there were "only" a > maximum of 128 qrsh processes, therefore a maximum of 128 orted's, therefore > waiting a loooong time. > > Increasing plm_rsh_num_concurrent beyond the default of 128 gets the job to > launch. > > Is this intentional, please? > > Doesn't openmpi use a tree-like startup sometimes - any particular reason > it's not using it here? > > Cheers, > > Mark > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users