OK, just checked and you're right: both processes get run on the first node. So it seems that the "hostfile" option in mpirun, that in my case refers to a file properly listing two nodes, like: -------------------- node1 node2 -------------------- is ignored.
I also tried logging in to node1, and launching using mpirun directly, without PBS, and the same thing happens. However, if I specify "host" options instead, then ranks get started on different nodes, and it all works properly. Then I tried the same from within the PBS script, and it worked. Thus, to summarize, instead of: mpirun -n 2 -hostfile $PBS_NODEFILE ./foo one should use: mpirun -n 2 --host node1,node2 ./foo Rather strange, but it's important that it works somehow. Thanks for your help! On Tue, Jan 18, 2022 at 10:54 PM Ralph Castain via users <users@lists.open-mpi.org> wrote: > > Are you launching the job with "mpirun"? I'm not familiar with that cmd line > and don't know what it does. > > Most likely explanation is that the mpirun from the prebuilt versions doesn't > have TM support, and therefore doesn't understand the 1ppn directive in your > cmd line. My guess is that you are using the ssh launcher - what is odd is > that you should wind up with two procs on the first node, in which case those > envars are correct. If you are seeing one proc on each node, then something > is wrong. > > > > On Jan 18, 2022, at 1:33 PM, Crni Gorac via users > > <users@lists.open-mpi.org> wrote: > > > > I have one process per node, here is corresponding line from my job > > submission script (with compute nodes named "node1" and "node2"): > > > > #PBS -l > > select=1:ncpus=1:mpiprocs=1:host=node1+1:ncpus=1:mpiprocs=1:host=node2 > > > > On Tue, Jan 18, 2022 at 10:20 PM Ralph Castain via users > > <users@lists.open-mpi.org> wrote: > >> > >> Afraid I can't understand your scenario - when you say you "submit a job" > >> to run on two nodes, how many processes are you running on each node?? > >> > >> > >>> On Jan 18, 2022, at 1:07 PM, Crni Gorac via users > >>> <users@lists.open-mpi.org> wrote: > >>> > >>> Using OpenMPI 4.1.2 from MLNX_OFED_LINUX-5.5-1.0.3.2 distribution, and > >>> have PBS 18.1.4 installed on my cluster (cluster nodes are running > >>> CentOS 7.9). When I try to submit a job that will run on two nodes in > >>> the cluster, both ranks get OMPI_COMM_WORLD_LOCAL_SIZE set to 2, > >>> instead of 1, and OMPI_COMM_WORLD_LOCAL_RANK are set to 0 and 1, > >>> instead of both being 0. At the same time, the hostfile generated by > >>> PBS ($PBS_NODEFILE) properly contains two nodes listed. > >>> > >>> I've tried with OpenMPI 3 from HPC-X, and the same thing happens too. > >>> However, when I build OpenMPI myself (notable difference from above > >>> mentioned pre-built MPI versions is that I use "--with-tm" option to > >>> point to my PBS installation), then OMPI_COMM_WORLD_LOCAL_SIZE and > >>> OMPI_COMM_WORLD_LOCAL_RANK are set properly. > >>> > >>> I'm not sure how to debug the problem, and whether it is possible to > >>> fix it at all with a pre-built OpenMPI version, so any suggestion is > >>> welcome. > >>> > >>> Thanks. > >> > >> > >