Hello Ralph,

Here is the output for a failing machine:

[130_02:44:13_aleblanc@farbauti]{~}$ > mpirun --mca
btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0
--mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues
P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca
ras_base_verbose 5 IMB-MPI1

======================   ALLOCATED NODES   ======================
farbauti: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
hyperion-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 7 slots
that were requested by the application:
  10

Either request fewer slots for your application, or make more slots
available
for use.
--------------------------------------------------------------------------


Here is an output of a passing machine:

[1_02:54:26_aleblanc@hyperion]{~}$ > mpirun --mca
btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0
--mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues
P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca
ras_base_verbose 5 IMB-MPI1

======================   ALLOCATED NODES   ======================
hyperion: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
farbauti-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================


Yes the hostfile is available on all nodes through an NFS mount for all of
our home directories.

On Thu, Nov 1, 2018 at 2:44 PM Adam LeBlanc <alebl...@iol.unh.edu> wrote:

>
>
> ---------- Forwarded message ---------
> From: Ralph H Castain <r...@open-mpi.org>
> Date: Thu, Nov 1, 2018 at 2:34 PM
> Subject: Re: [OMPI users] Bug with Open-MPI Processor Count
> To: Open MPI Users <users@lists.open-mpi.org>
>
>
> I’m a little under the weather and so will only be able to help a bit at a
> time. However, a couple of things to check:
>
> * add -mca ras_base_verbose 5 to the cmd line to see what mpirun thought
> the allocation was
>
> * is the hostfile available on every node?
>
> Ralph
>
> On Nov 1, 2018, at 10:55 AM, Adam LeBlanc <alebl...@iol.unh.edu> wrote:
>
> Hello Ralph,
>
> Attached below is the verbose output for a failing machine and a passing
> machine.
>
> Thanks,
> Adam LeBlanc
>
> On Thu, Nov 1, 2018 at 1:41 PM Adam LeBlanc <alebl...@iol.unh.edu> wrote:
>
>>
>>
>> ---------- Forwarded message ---------
>> From: Ralph H Castain <r...@open-mpi.org>
>> Date: Thu, Nov 1, 2018 at 1:07 PM
>> Subject: Re: [OMPI users] Bug with Open-MPI Processor Count
>> To: Open MPI Users <users@lists.open-mpi.org>
>>
>>
>> Set rmaps_base_verbose=10 for debugging output
>>
>> Sent from my iPhone
>>
>> On Nov 1, 2018, at 9:31 AM, Adam LeBlanc <alebl...@iol.unh.edu> wrote:
>>
>> The version by the way for Open-MPI is 3.1.2.
>>
>> -Adam LeBlanc
>>
>> On Thu, Nov 1, 2018 at 12:05 PM Adam LeBlanc <alebl...@iol.unh.edu>
>> wrote:
>>
>>> Hello, I am an employee of the UNH InterOperability Lab, and we are in
>>> the process of testing OFED-4.17-RC1 for the OpenFabrics Alliance. We have
>>> purchased some new hardware that has one processor, and noticed an issue
>>> when running mpi jobs on nodes that do not have similar processor counts.
>>> If we launch the MPI job from a node that has 2 processors, it will fail
>>> and stating there are not enough resources and will not start the run, like
>>> so:
>>> --------------------------------------------------------------------------
>>> There are not enough slots available in the system to satisfy the 14 slots
>>> that were requested by the application:   IMB-MPI1 Either request fewer
>>> slots for your application, or make more slots available for use.
>>> --------------------------------------------------------------------------
>>> If we launch the MPI job from the node with one processor, without changing
>>> the mpirun command at all, it runs as expected. Here is the command being
>>> run: mpirun --mca btl_openib_warn_no_device_params_found 0 --mca
>>> orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
>>> btl_openib_receive_queues P,65536,120,64,32 -hostfile
>>> /home/soesterreich/ce-mpi-hosts IMB-MPI1 Here is the hostfile being used:
>>> farbauti-ce.ofa.iol.unh.edu slots=1 hyperion-ce.ofa.iol.unh.edu slots=1
>>> io-ce.ofa.iol.unh.edu slots=1 jarnsaxa-ce.ofa.iol.unh.edu slots=1
>>> rhea-ce.ofa.iol.unh.edu slots=1 tarqeq-ce.ofa.iol.unh.edu slots=1
>>> tarvos-ce.ofa.iol.unh.edu slots=1 This seems like a bug and we would
>>> like some help to explain and fix what is happening. The IBTA plugfest saw
>>> similar behaviours, so this should be reproduceable. Thanks, Adam LeBlanc
>>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
> <passing_verbose_output.txt><failing_verbose_output.txt>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to