Hi,
While commissioning a new cluster, I wanted to run HPL across the whole
thing using openmpi 2.0.1.
I couldn't get it to start on more than 129 hosts under Son of Gridengine
(128 remote plus the localhost running the mpirun command). openmpi would
sit there, waiting for all the orted's to check in; however, there were
"only" a maximum of 128 qrsh processes, therefore a maximum of 128
orted's, therefore waiting a loooong time.
Increasing plm_rsh_num_concurrent beyond the default of 128 gets the job
to launch.
Is this intentional, please?
Doesn't openmpi use a tree-like startup sometimes - any particular reason
it's not using it here?
Cheers,
Mark
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users