Hello Ralph, Here is the output for a failing machine:
[130_02:44:13_aleblanc@farbauti]{~}$ > mpirun --mca btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca ras_base_verbose 5 IMB-MPI1 ====================== ALLOCATED NODES ====================== farbauti: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP hyperion-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN ================================================================= -------------------------------------------------------------------------- There are not enough slots available in the system to satisfy the 7 slots that were requested by the application: 10 Either request fewer slots for your application, or make more slots available for use. -------------------------------------------------------------------------- Here is an output of a passing machine: [1_02:54:26_aleblanc@hyperion]{~}$ > mpirun --mca btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca ras_base_verbose 5 IMB-MPI1 ====================== ALLOCATED NODES ====================== hyperion: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP farbauti-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN ================================================================= Yes the hostfile is available on all nodes through an NFS mount for all of our home directories. On Thu, Nov 1, 2018 at 2:44 PM Adam LeBlanc <alebl...@iol.unh.edu> wrote: > > > ---------- Forwarded message --------- > From: Ralph H Castain <r...@open-mpi.org> > Date: Thu, Nov 1, 2018 at 2:34 PM > Subject: Re: [OMPI users] Bug with Open-MPI Processor Count > To: Open MPI Users <users@lists.open-mpi.org> > > > I’m a little under the weather and so will only be able to help a bit at a > time. However, a couple of things to check: > > * add -mca ras_base_verbose 5 to the cmd line to see what mpirun thought > the allocation was > > * is the hostfile available on every node? > > Ralph > > On Nov 1, 2018, at 10:55 AM, Adam LeBlanc <alebl...@iol.unh.edu> wrote: > > Hello Ralph, > > Attached below is the verbose output for a failing machine and a passing > machine. > > Thanks, > Adam LeBlanc > > On Thu, Nov 1, 2018 at 1:41 PM Adam LeBlanc <alebl...@iol.unh.edu> wrote: > >> >> >> ---------- Forwarded message --------- >> From: Ralph H Castain <r...@open-mpi.org> >> Date: Thu, Nov 1, 2018 at 1:07 PM >> Subject: Re: [OMPI users] Bug with Open-MPI Processor Count >> To: Open MPI Users <users@lists.open-mpi.org> >> >> >> Set rmaps_base_verbose=10 for debugging output >> >> Sent from my iPhone >> >> On Nov 1, 2018, at 9:31 AM, Adam LeBlanc <alebl...@iol.unh.edu> wrote: >> >> The version by the way for Open-MPI is 3.1.2. >> >> -Adam LeBlanc >> >> On Thu, Nov 1, 2018 at 12:05 PM Adam LeBlanc <alebl...@iol.unh.edu> >> wrote: >> >>> Hello, I am an employee of the UNH InterOperability Lab, and we are in >>> the process of testing OFED-4.17-RC1 for the OpenFabrics Alliance. We have >>> purchased some new hardware that has one processor, and noticed an issue >>> when running mpi jobs on nodes that do not have similar processor counts. >>> If we launch the MPI job from a node that has 2 processors, it will fail >>> and stating there are not enough resources and will not start the run, like >>> so: >>> -------------------------------------------------------------------------- >>> There are not enough slots available in the system to satisfy the 14 slots >>> that were requested by the application: IMB-MPI1 Either request fewer >>> slots for your application, or make more slots available for use. >>> -------------------------------------------------------------------------- >>> If we launch the MPI job from the node with one processor, without changing >>> the mpirun command at all, it runs as expected. Here is the command being >>> run: mpirun --mca btl_openib_warn_no_device_params_found 0 --mca >>> orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca >>> btl_openib_receive_queues P,65536,120,64,32 -hostfile >>> /home/soesterreich/ce-mpi-hosts IMB-MPI1 Here is the hostfile being used: >>> farbauti-ce.ofa.iol.unh.edu slots=1 hyperion-ce.ofa.iol.unh.edu slots=1 >>> io-ce.ofa.iol.unh.edu slots=1 jarnsaxa-ce.ofa.iol.unh.edu slots=1 >>> rhea-ce.ofa.iol.unh.edu slots=1 tarqeq-ce.ofa.iol.unh.edu slots=1 >>> tarvos-ce.ofa.iol.unh.edu slots=1 This seems like a bug and we would >>> like some help to explain and fix what is happening. The IBTA plugfest saw >>> similar behaviours, so this should be reproduceable. Thanks, Adam LeBlanc >>> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users >> > <passing_verbose_output.txt><failing_verbose_output.txt> > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users