I suspect it read the file just fine - what you are seeing in the output is a 
reflection of the community’s design decision that only one slot would be 
allocated for each time a node is listed in -host. This is why they added the 
:N modifier so you can specify the #slots to use in lieu of writing the host 
name N times

If this isn’t what you feel it should do, then please look at the files in 
orte/util/dash_host and feel free to propose a modification to the behavior. I 
personally am not bound to any particular answer, but I really don’t have time 
to address it again.



> On Apr 25, 2017, at 1:35 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
> 
> Just to be clear, the hostfile contains the correct info:
> 
> dancer00 slots=8
> dancer01 slots=8
> 
> The output regarding the 2 nodes (dancer00 and dancer01) is clearly wrong.
> 
>   George.
> 
> 
> 
> On Tue, Apr 25, 2017 at 4:32 PM, George Bosilca <bosi...@icl.utk.edu 
> <mailto:bosi...@icl.utk.edu>> wrote:
> I confirm a similar issue on a more managed environment. I have an hostfile 
> that worked for the last few years, and that span across a small cluster (30 
> nodes of 8 cores each). 
> 
> Trying to spawn any number of processes across P nodes fails if the number of 
> processes is larger than P (despite the fact that there are largely enough 
> resources, and that this information is provided via the hostfile).
> 
> George.
> 
> 
> $ mpirun -mca ras_base_verbose 10 --display-allocation -np 4 --host 
> dancer00,dancer01 --map-by
> 
> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
> components_register: registering framework ras components
> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
> components_register: found loaded component simulator
> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
> components_register: component simulator register function successful
> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
> components_register: found loaded component slurm
> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
> components_register: component slurm register function successful
> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
> components_register: found loaded component loadleveler
> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
> components_register: component loadleveler register function successful
> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
> components_register: found loaded component tm
> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
> components_register: component tm register function successful
> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
> components_open: opening ras components
> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
> components_open: found loaded component simulator
> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
> components_open: found loaded component slurm
> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
> components_open: component slurm open function successful
> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
> components_open: found loaded component loadleveler
> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
> components_open: component loadleveler open function successful
> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
> components_open: found loaded component tm
> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
> components_open: component tm open function successful
> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] 
> mca:base:select: Auto-selecting ras components
> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] 
> mca:base:select:(  ras) Querying component [simulator]
> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] 
> mca:base:select:(  ras) Querying component [slurm]
> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] 
> mca:base:select:(  ras) Querying component [loadleveler]
> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] 
> mca:base:select:(  ras) Querying component [tm]
> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] 
> mca:base:select:(  ras) No component selected!
> 
> ======================   ALLOCATED NODES   ======================
>       dancer00: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer01: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> =================================================================
> 
> ======================   ALLOCATED NODES   ======================
>       dancer00: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer01: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>       dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> =================================================================
> --------------------------------------------------------------------------
> There are not enough slots available in the system to satisfy the 4 slots
> that were requested by the application:
>   startup
> 
> Either request fewer slots for your application, or make more slots available
> for use.
> --------------------------------------------------------------------------
> 
> 
> 
> 
> On Tue, Apr 25, 2017 at 4:00 PM, r...@open-mpi.org <mailto:r...@open-mpi.org> 
> <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote:
> Okay - so effectively you have no hostfile, and no allocation. So this is 
> running just on the one node where mpirun exists?
> 
> Add “-mca ras_base_verbose 10 --display-allocation” to your cmd line and 
> let’s see what it found
> 
> > On Apr 25, 2017, at 12:56 PM, Eric Chamberland 
> > <eric.chamberl...@giref.ulaval.ca 
> > <mailto:eric.chamberl...@giref.ulaval.ca>> wrote:
> >
> > Hi,
> >
> > the host file has been constructed automatically by the 
> > configuration+installation process and seems to contain only comments and a 
> > blank line:
> >
> > (15:53:50) [zorg]:~> cat /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
> > #
> > # Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
> > #                         University Research and Technology
> > #                         Corporation.  All rights reserved.
> > # Copyright (c) 2004-2005 The University of Tennessee and The University
> > #                         of Tennessee Research Foundation.  All rights
> > #                         reserved.
> > # Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
> > #                         University of Stuttgart.  All rights reserved.
> > # Copyright (c) 2004-2005 The Regents of the University of California.
> > #                         All rights reserved.
> > # $COPYRIGHT$
> > #
> > # Additional copyrights may follow
> > #
> > # $HEADER$
> > #
> > # This is the default hostfile for Open MPI.  Notice that it does not
> > # contain any hosts (not even localhost).  This file should only
> > # contain hosts if a system administrator wants users to always have
> > # the same set of default hosts, and is not using a batch scheduler
> > # (such as SLURM, PBS, etc.).
> > #
> > # Note that this file is *not* used when running in "managed"
> > # environments (e.g., running in a job under a job scheduler, such as
> > # SLURM or PBS / Torque).
> > #
> > # If you are primarily interested in running Open MPI on one node, you
> > # should *not* simply list "localhost" in here (contrary to prior MPI
> > # implementations, such as LAM/MPI).  A localhost-only node list is
> > # created by the RAS component named "localhost" if no other RAS
> > # components were able to find any hosts to run on (this behavior can
> > # be disabled by excluding the localhost RAS component by specifying
> > # the value "^localhost" [without the quotes] to the "ras" MCA
> > # parameter).
> >
> > (15:53:52) [zorg]:~>
> >
> > Thanks!
> >
> > Eric
> >
> >
> > On 25/04/17 03:52 PM, r...@open-mpi.org <mailto:r...@open-mpi.org> wrote:
> >> What is in your hostfile?
> >>
> >>
> >>> On Apr 25, 2017, at 11:39 AM, Eric Chamberland 
> >>> <eric.chamberl...@giref.ulaval.ca 
> >>> <mailto:eric.chamberl...@giref.ulaval.ca>> wrote:
> >>>
> >>> Hi,
> >>>
> >>> just testing the 3.x branch... I launch:
> >>>
> >>> mpirun -n 8 echo "hello"
> >>>
> >>> and I get:
> >>>
> >>> --------------------------------------------------------------------------
> >>> There are not enough slots available in the system to satisfy the 8 slots
> >>> that were requested by the application:
> >>> echo
> >>>
> >>> Either request fewer slots for your application, or make more slots 
> >>> available
> >>> for use.
> >>> --------------------------------------------------------------------------
> >>>
> >>> I have to oversubscribe, so what do I have to do to bypass this 
> >>> "limitation"?
> >>>
> >>> Thanks,
> >>>
> >>> Eric
> >>>
> >>> configure log:
> >>>
> >>> http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_config.log
> >>>  
> >>> <http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_config.log>
> >>> http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_ompi_info_all.txt
> >>>  
> >>> <http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_ompi_info_all.txt>
> >>>
> >>>
> >>> here is the complete message:
> >>>
> >>> [zorg:30036] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path 
> >>> NULL
> >>> [zorg:30036] plm:base:set_hnp_name: initial bias 30036 nodename hash 
> >>> 810220270
> >>> [zorg:30036] plm:base:set_hnp_name: final jobfam 49136
> >>> [zorg:30036] [[49136,0],0] plm:rsh_setup on agent ssh : rsh path NULL
> >>> [zorg:30036] [[49136,0],0] plm:base:receive start comm
> >>> [zorg:30036] [[49136,0],0] plm:base:setup_job
> >>> [zorg:30036] [[49136,0],0] plm:base:setup_vm
> >>> [zorg:30036] [[49136,0],0] plm:base:setup_vm creating map
> >>> [zorg:30036] [[49136,0],0] setup:vm: working unmanaged allocation
> >>> [zorg:30036] [[49136,0],0] using default hostfile 
> >>> /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
> >>> [zorg:30036] [[49136,0],0] plm:base:setup_vm only HNP in allocation
> >>> [zorg:30036] [[49136,0],0] plm:base:setting slots for node zorg by cores
> >>> [zorg:30036] [[49136,0],0] complete_setup on job [49136,1]
> >>> [zorg:30036] [[49136,0],0] plm:base:launch_apps for job [49136,1]
> >>> --------------------------------------------------------------------------
> >>> There are not enough slots available in the system to satisfy the 8 slots
> >>> that were requested by the application:
> >>> echo
> >>>
> >>> Either request fewer slots for your application, or make more slots 
> >>> available
> >>> for use.
> >>> --------------------------------------------------------------------------
> >>> [zorg:30036] [[49136,0],0] plm:base:orted_cmd sending orted_exit commands
> >>> [zorg:30036] [[49136,0],0] plm:base:receive stop comm
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
> >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
> >>
> >> _______________________________________________
> >> users mailing list
> >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
> >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
> >>
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
> > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to