Thanks Ralph,

Indeed, if I add :8 I get back the expected behavior. I can cope with this
(I don't usually restrict my runs to a subset of the nodes).

  George.


On Tue, Apr 25, 2017 at 4:53 PM, r...@open-mpi.org <r...@open-mpi.org> wrote:

> I suspect it read the file just fine - what you are seeing in the output
> is a reflection of the community’s design decision that only one slot would
> be allocated for each time a node is listed in -host. This is why they
> added the :N modifier so you can specify the #slots to use in lieu of
> writing the host name N times
>
> If this isn’t what you feel it should do, then please look at the files in
> orte/util/dash_host and feel free to propose a modification to the
> behavior. I personally am not bound to any particular answer, but I really
> don’t have time to address it again.
>
>
>
> On Apr 25, 2017, at 1:35 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
>
> Just to be clear, the hostfile contains the correct info:
>
> dancer00 slots=8
> dancer01 slots=8
>
> The output regarding the 2 nodes (dancer00 and dancer01) is clearly wrong.
>
>   George.
>
>
>
> On Tue, Apr 25, 2017 at 4:32 PM, George Bosilca <bosi...@icl.utk.edu>
> wrote:
>
>> I confirm a similar issue on a more managed environment. I have an
>> hostfile that worked for the last few years, and that span across a small
>> cluster (30 nodes of 8 cores each).
>>
>> Trying to spawn any number of processes across P nodes fails if the
>> number of processes is larger than P (despite the fact that there are
>> largely enough resources, and that this information is provided via the
>> hostfile).
>>
>> George.
>>
>>
>> $ mpirun -mca ras_base_verbose 10 --display-allocation -np 4 --host
>> dancer00,dancer01 --map-by
>>
>> [dancer.icl.utk.edu:13457] mca: base: components_register: registering
>> framework ras components
>> [dancer.icl.utk.edu:13457] mca: base: components_register: found loaded
>> component simulator
>> [dancer.icl.utk.edu:13457] mca: base: components_register: component
>> simulator register function successful
>> [dancer.icl.utk.edu:13457] mca: base: components_register: found loaded
>> component slurm
>> [dancer.icl.utk.edu:13457] mca: base: components_register: component
>> slurm register function successful
>> [dancer.icl.utk.edu:13457] mca: base: components_register: found loaded
>> component loadleveler
>> [dancer.icl.utk.edu:13457] mca: base: components_register: component
>> loadleveler register function successful
>> [dancer.icl.utk.edu:13457] mca: base: components_register: found loaded
>> component tm
>> [dancer.icl.utk.edu:13457] mca: base: components_register: component tm
>> register function successful
>> [dancer.icl.utk.edu:13457] mca: base: components_open: opening ras
>> components
>> [dancer.icl.utk.edu:13457] mca: base: components_open: found loaded
>> component simulator
>> [dancer.icl.utk.edu:13457] mca: base: components_open: found loaded
>> component slurm
>> [dancer.icl.utk.edu:13457] mca: base: components_open: component slurm
>> open function successful
>> [dancer.icl.utk.edu:13457] mca: base: components_open: found loaded
>> component loadleveler
>> [dancer.icl.utk.edu:13457] mca: base: components_open: component
>> loadleveler open function successful
>> [dancer.icl.utk.edu:13457] mca: base: components_open: found loaded
>> component tm
>> [dancer.icl.utk.edu:13457] mca: base: components_open: component tm open
>> function successful
>> [dancer.icl.utk.edu:13457] mca:base:select: Auto-selecting ras components
>> [dancer.icl.utk.edu:13457] mca:base:select:(  ras) Querying component
>> [simulator]
>> [dancer.icl.utk.edu:13457] mca:base:select:(  ras) Querying component
>> [slurm]
>> [dancer.icl.utk.edu:13457] mca:base:select:(  ras) Querying component
>> [loadleveler]
>> [dancer.icl.utk.edu:13457] mca:base:select:(  ras) Querying component
>> [tm]
>> [dancer.icl.utk.edu:13457] mca:base:select:(  ras) No component selected!
>>
>> ======================   ALLOCATED NODES   ======================
>> dancer00: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer01: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> =================================================================
>>
>> ======================   ALLOCATED NODES   ======================
>> dancer00: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer01: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> =================================================================
>> ------------------------------------------------------------
>> --------------
>> There are not enough slots available in the system to satisfy the 4 slots
>> that were requested by the application:
>>   startup
>>
>> Either request fewer slots for your application, or make more slots
>> available
>> for use.
>> ------------------------------------------------------------
>> --------------
>>
>>
>>
>>
>> On Tue, Apr 25, 2017 at 4:00 PM, r...@open-mpi.org <r...@open-mpi.org>
>> wrote:
>>
>>> Okay - so effectively you have no hostfile, and no allocation. So this
>>> is running just on the one node where mpirun exists?
>>>
>>> Add “-mca ras_base_verbose 10 --display-allocation” to your cmd line and
>>> let’s see what it found
>>>
>>> > On Apr 25, 2017, at 12:56 PM, Eric Chamberland <
>>> eric.chamberl...@giref.ulaval.ca> wrote:
>>> >
>>> > Hi,
>>> >
>>> > the host file has been constructed automatically by the
>>> configuration+installation process and seems to contain only comments and a
>>> blank line:
>>> >
>>> > (15:53:50) [zorg]:~> cat /opt/openmpi-3.x_debug/etc/ope
>>> nmpi-default-hostfile
>>> > #
>>> > # Copyright (c) 2004-2005 The Trustees of Indiana University and
>>> Indiana
>>> > #                         University Research and Technology
>>> > #                         Corporation.  All rights reserved.
>>> > # Copyright (c) 2004-2005 The University of Tennessee and The
>>> University
>>> > #                         of Tennessee Research Foundation.  All rights
>>> > #                         reserved.
>>> > # Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
>>> > #                         University of Stuttgart.  All rights
>>> reserved.
>>> > # Copyright (c) 2004-2005 The Regents of the University of California.
>>> > #                         All rights reserved.
>>> > # $COPYRIGHT$
>>> > #
>>> > # Additional copyrights may follow
>>> > #
>>> > # $HEADER$
>>> > #
>>> > # This is the default hostfile for Open MPI.  Notice that it does not
>>> > # contain any hosts (not even localhost).  This file should only
>>> > # contain hosts if a system administrator wants users to always have
>>> > # the same set of default hosts, and is not using a batch scheduler
>>> > # (such as SLURM, PBS, etc.).
>>> > #
>>> > # Note that this file is *not* used when running in "managed"
>>> > # environments (e.g., running in a job under a job scheduler, such as
>>> > # SLURM or PBS / Torque).
>>> > #
>>> > # If you are primarily interested in running Open MPI on one node, you
>>> > # should *not* simply list "localhost" in here (contrary to prior MPI
>>> > # implementations, such as LAM/MPI).  A localhost-only node list is
>>> > # created by the RAS component named "localhost" if no other RAS
>>> > # components were able to find any hosts to run on (this behavior can
>>> > # be disabled by excluding the localhost RAS component by specifying
>>> > # the value "^localhost" [without the quotes] to the "ras" MCA
>>> > # parameter).
>>> >
>>> > (15:53:52) [zorg]:~>
>>> >
>>> > Thanks!
>>> >
>>> > Eric
>>> >
>>> >
>>> > On 25/04/17 03:52 PM, r...@open-mpi.org wrote:
>>> >> What is in your hostfile?
>>> >>
>>> >>
>>> >>> On Apr 25, 2017, at 11:39 AM, Eric Chamberland <
>>> eric.chamberl...@giref.ulaval.ca> wrote:
>>> >>>
>>> >>> Hi,
>>> >>>
>>> >>> just testing the 3.x branch... I launch:
>>> >>>
>>> >>> mpirun -n 8 echo "hello"
>>> >>>
>>> >>> and I get:
>>> >>>
>>> >>> ------------------------------------------------------------
>>> --------------
>>> >>> There are not enough slots available in the system to satisfy the 8
>>> slots
>>> >>> that were requested by the application:
>>> >>> echo
>>> >>>
>>> >>> Either request fewer slots for your application, or make more slots
>>> available
>>> >>> for use.
>>> >>> ------------------------------------------------------------
>>> --------------
>>> >>>
>>> >>> I have to oversubscribe, so what do I have to do to bypass this
>>> "limitation"?
>>> >>>
>>> >>> Thanks,
>>> >>>
>>> >>> Eric
>>> >>>
>>> >>> configure log:
>>> >>>
>>> >>> http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h
>>> 46m08s_config.log
>>> >>> http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h
>>> 46m08s_ompi_info_all.txt
>>> >>>
>>> >>>
>>> >>> here is the complete message:
>>> >>>
>>> >>> [zorg:30036] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
>>> path NULL
>>> >>> [zorg:30036] plm:base:set_hnp_name: initial bias 30036 nodename hash
>>> 810220270
>>> >>> [zorg:30036] plm:base:set_hnp_name: final jobfam 49136
>>> >>> [zorg:30036] [[49136,0],0] plm:rsh_setup on agent ssh : rsh path NULL
>>> >>> [zorg:30036] [[49136,0],0] plm:base:receive start comm
>>> >>> [zorg:30036] [[49136,0],0] plm:base:setup_job
>>> >>> [zorg:30036] [[49136,0],0] plm:base:setup_vm
>>> >>> [zorg:30036] [[49136,0],0] plm:base:setup_vm creating map
>>> >>> [zorg:30036] [[49136,0],0] setup:vm: working unmanaged allocation
>>> >>> [zorg:30036] [[49136,0],0] using default hostfile
>>> /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
>>> >>> [zorg:30036] [[49136,0],0] plm:base:setup_vm only HNP in allocation
>>> >>> [zorg:30036] [[49136,0],0] plm:base:setting slots for node zorg by
>>> cores
>>> >>> [zorg:30036] [[49136,0],0] complete_setup on job [49136,1]
>>> >>> [zorg:30036] [[49136,0],0] plm:base:launch_apps for job [49136,1]
>>> >>> ------------------------------------------------------------
>>> --------------
>>> >>> There are not enough slots available in the system to satisfy the 8
>>> slots
>>> >>> that were requested by the application:
>>> >>> echo
>>> >>>
>>> >>> Either request fewer slots for your application, or make more slots
>>> available
>>> >>> for use.
>>> >>> ------------------------------------------------------------
>>> --------------
>>> >>> [zorg:30036] [[49136,0],0] plm:base:orted_cmd sending orted_exit
>>> commands
>>> >>> [zorg:30036] [[49136,0],0] plm:base:receive stop comm
>>> >>>
>>> >>> _______________________________________________
>>> >>> users mailing list
>>> >>> users@lists.open-mpi.org
>>> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>> >>
>>> >> _______________________________________________
>>> >> users mailing list
>>> >> users@lists.open-mpi.org
>>> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>> >>
>>> > _______________________________________________
>>> > users mailing list
>>> > users@lists.open-mpi.org
>>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>
>>
>>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to