Hmmm - try adding a value for nprocs instead of leaving it blank. Say ā€œ-np 7ā€

Sent from my iPhone

> On Nov 1, 2018, at 11:56 AM, Adam LeBlanc <alebl...@iol.unh.edu> wrote:
> 
> Hello Ralph,
> 
> Here is the output for a failing machine:
> 
> [130_02:44:13_aleblanc@farbauti]{~}$ > mpirun --mca 
> btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0 
> --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues 
> P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca 
> ras_base_verbose 5 IMB-MPI1
> 
> ======================   ALLOCATED NODES   ======================
>       farbauti: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
>       hyperion-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>       io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>       jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>       rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>       tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>       tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
> =================================================================
> --------------------------------------------------------------------------
> There are not enough slots available in the system to satisfy the 7 slots
> that were requested by the application:
>   10
> 
> Either request fewer slots for your application, or make more slots available
> for use.
> --------------------------------------------------------------------------
> 
> 
> Here is an output of a passing machine:
> 
> [1_02:54:26_aleblanc@hyperion]{~}$ > mpirun --mca 
> btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0 
> --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues 
> P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca 
> ras_base_verbose 5 IMB-MPI1
> 
> ======================   ALLOCATED NODES   ======================
>       hyperion: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
>       farbauti-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>       io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>       jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>       rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>       tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>       tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
> =================================================================
> 
> 
> Yes the hostfile is available on all nodes through an NFS mount for all of 
> our home directories.
> 
>> On Thu, Nov 1, 2018 at 2:44 PM Adam LeBlanc <alebl...@iol.unh.edu> wrote:
>> 
>> 
>> ---------- Forwarded message ---------
>> From: Ralph H Castain <r...@open-mpi.org>
>> Date: Thu, Nov 1, 2018 at 2:34 PM
>> Subject: Re: [OMPI users] Bug with Open-MPI Processor Count
>> To: Open MPI Users <users@lists.open-mpi.org>
>> 
>> 
>> I’m a little under the weather and so will only be able to help a bit at a 
>> time. However, a couple of things to check:
>> 
>> * add -mca ras_base_verbose 5 to the cmd line to see what mpirun thought the 
>> allocation was
>> 
>> * is the hostfile available on every node?
>> 
>> Ralph
>> 
>>> On Nov 1, 2018, at 10:55 AM, Adam LeBlanc <alebl...@iol.unh.edu> wrote:
>>> 
>>> Hello Ralph,
>>> 
>>> Attached below is the verbose output for a failing machine and a passing 
>>> machine.
>>> 
>>> Thanks,
>>> Adam LeBlanc
>>> 
>>>> On Thu, Nov 1, 2018 at 1:41 PM Adam LeBlanc <alebl...@iol.unh.edu> wrote:
>>>> 
>>>> 
>>>> ---------- Forwarded message ---------
>>>> From: Ralph H Castain <r...@open-mpi.org>
>>>> Date: Thu, Nov 1, 2018 at 1:07 PM
>>>> Subject: Re: [OMPI users] Bug with Open-MPI Processor Count
>>>> To: Open MPI Users <users@lists.open-mpi.org>
>>>> 
>>>> 
>>>> Set rmaps_base_verbose=10 for debugging output 
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>>> On Nov 1, 2018, at 9:31 AM, Adam LeBlanc <alebl...@iol.unh.edu> wrote:
>>>>> 
>>>>> The version by the way for Open-MPI is 3.1.2.
>>>>> 
>>>>> -Adam LeBlanc
>>>>> 
>>>>>> On Thu, Nov 1, 2018 at 12:05 PM Adam LeBlanc <alebl...@iol.unh.edu> 
>>>>>> wrote:
>>>>>> Hello,
>>>>>> 
>>>>>> I am an employee of the UNH InterOperability Lab, and we are in the 
>>>>>> process of testing OFED-4.17-RC1 for the OpenFabrics Alliance. We have 
>>>>>> purchased some new hardware that has one processor, and noticed an issue 
>>>>>> when running mpi jobs on nodes that do not have similar processor 
>>>>>> counts. If we launch the MPI job from a node that has 2 processors, it 
>>>>>> will fail and stating there are not enough resources and will not start 
>>>>>> the run, like so:
>>>>>> 
>>>>>> --------------------------------------------------------------------------
>>>>>> There are not enough slots available in the system to satisfy the 14 
>>>>>> slots
>>>>>> that were requested by the application:
>>>>>>   IMB-MPI1
>>>>>> 
>>>>>> Either request fewer slots for your application, or make more slots 
>>>>>> available
>>>>>> for use.
>>>>>> --------------------------------------------------------------------------
>>>>>> 
>>>>>> If we launch the MPI job from the node with one processor, without 
>>>>>> changing the mpirun command at all, it runs as expected.
>>>>>> 
>>>>>> Here is the command being run:
>>>>>> 
>>>>>> mpirun --mca btl_openib_warn_no_device_params_found 0 --mca 
>>>>>> orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 
>>>>>> --mca btl_openib_receive_queues P,65536,120,64,32 -hostfile 
>>>>>> /home/soesterreich/ce-mpi-hosts IMB-MPI1
>>>>>> 
>>>>>> Here is the hostfile being used:
>>>>>> 
>>>>>> farbauti-ce.ofa.iol.unh.edu slots=1
>>>>>> hyperion-ce.ofa.iol.unh.edu slots=1
>>>>>> io-ce.ofa.iol.unh.edu slots=1
>>>>>> jarnsaxa-ce.ofa.iol.unh.edu slots=1
>>>>>> rhea-ce.ofa.iol.unh.edu slots=1
>>>>>> tarqeq-ce.ofa.iol.unh.edu slots=1
>>>>>> tarvos-ce.ofa.iol.unh.edu slots=1
>>>>>> 
>>>>>> This seems like a bug and we would like some help to explain and fix 
>>>>>> what is happening. The IBTA plugfest saw similar behaviours, so this 
>>>>>> should be reproduceable.
>>>>>> 
>>>>>> Thanks,
>>>>>> Adam LeBlanc
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>> <passing_verbose_output.txt><failing_verbose_output.txt>_______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to