If it helps, I believe I added the ability to just use ‘:*’ to indicate “take 
them all” so you don’t have to remember the number.

> On Apr 25, 2017, at 2:13 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
> 
> Thanks Ralph,
> 
> Indeed, if I add :8 I get back the expected behavior. I can cope with this (I 
> don't usually restrict my runs to a subset of the nodes).
> 
>   George.
> 
> 
> On Tue, Apr 25, 2017 at 4:53 PM, r...@open-mpi.org <mailto:r...@open-mpi.org> 
> <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote:
> I suspect it read the file just fine - what you are seeing in the output is a 
> reflection of the community’s design decision that only one slot would be 
> allocated for each time a node is listed in -host. This is why they added the 
> :N modifier so you can specify the #slots to use in lieu of writing the host 
> name N times
> 
> If this isn’t what you feel it should do, then please look at the files in 
> orte/util/dash_host and feel free to propose a modification to the behavior. 
> I personally am not bound to any particular answer, but I really don’t have 
> time to address it again.
> 
> 
> 
>> On Apr 25, 2017, at 1:35 PM, George Bosilca <bosi...@icl.utk.edu 
>> <mailto:bosi...@icl.utk.edu>> wrote:
>> 
>> Just to be clear, the hostfile contains the correct info:
>> 
>> dancer00 slots=8
>> dancer01 slots=8
>> 
>> The output regarding the 2 nodes (dancer00 and dancer01) is clearly wrong.
>> 
>>   George.
>> 
>> 
>> 
>> On Tue, Apr 25, 2017 at 4:32 PM, George Bosilca <bosi...@icl.utk.edu 
>> <mailto:bosi...@icl.utk.edu>> wrote:
>> I confirm a similar issue on a more managed environment. I have an hostfile 
>> that worked for the last few years, and that span across a small cluster (30 
>> nodes of 8 cores each). 
>> 
>> Trying to spawn any number of processes across P nodes fails if the number 
>> of processes is larger than P (despite the fact that there are largely 
>> enough resources, and that this information is provided via the hostfile).
>> 
>> George.
>> 
>> 
>> $ mpirun -mca ras_base_verbose 10 --display-allocation -np 4 --host 
>> dancer00,dancer01 --map-by
>> 
>> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
>> components_register: registering framework ras components
>> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
>> components_register: found loaded component simulator
>> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
>> components_register: component simulator register function successful
>> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
>> components_register: found loaded component slurm
>> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
>> components_register: component slurm register function successful
>> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
>> components_register: found loaded component loadleveler
>> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
>> components_register: component loadleveler register function successful
>> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
>> components_register: found loaded component tm
>> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
>> components_register: component tm register function successful
>> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
>> components_open: opening ras components
>> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
>> components_open: found loaded component simulator
>> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
>> components_open: found loaded component slurm
>> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
>> components_open: component slurm open function successful
>> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
>> components_open: found loaded component loadleveler
>> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
>> components_open: component loadleveler open function successful
>> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
>> components_open: found loaded component tm
>> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: 
>> components_open: component tm open function successful
>> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] 
>> mca:base:select: Auto-selecting ras components
>> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] 
>> mca:base:select:(  ras) Querying component [simulator]
>> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] 
>> mca:base:select:(  ras) Querying component [slurm]
>> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] 
>> mca:base:select:(  ras) Querying component [loadleveler]
>> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] 
>> mca:base:select:(  ras) Querying component [tm]
>> [dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] 
>> mca:base:select:(  ras) No component selected!
>> 
>> ======================   ALLOCATED NODES   ======================
>>      dancer00: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer01: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> =================================================================
>> 
>> ======================   ALLOCATED NODES   ======================
>>      dancer00: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer01: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>      dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> =================================================================
>> --------------------------------------------------------------------------
>> There are not enough slots available in the system to satisfy the 4 slots
>> that were requested by the application:
>>   startup
>> 
>> Either request fewer slots for your application, or make more slots available
>> for use.
>> --------------------------------------------------------------------------
>> 
>> 
>> 
>> 
>> On Tue, Apr 25, 2017 at 4:00 PM, r...@open-mpi.org 
>> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>> 
>> wrote:
>> Okay - so effectively you have no hostfile, and no allocation. So this is 
>> running just on the one node where mpirun exists?
>> 
>> Add “-mca ras_base_verbose 10 --display-allocation” to your cmd line and 
>> let’s see what it found
>> 
>> > On Apr 25, 2017, at 12:56 PM, Eric Chamberland 
>> > <eric.chamberl...@giref.ulaval.ca 
>> > <mailto:eric.chamberl...@giref.ulaval.ca>> wrote:
>> >
>> > Hi,
>> >
>> > the host file has been constructed automatically by the 
>> > configuration+installation process and seems to contain only comments and 
>> > a blank line:
>> >
>> > (15:53:50) [zorg]:~> cat 
>> > /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
>> > #
>> > # Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
>> > #                         University Research and Technology
>> > #                         Corporation.  All rights reserved.
>> > # Copyright (c) 2004-2005 The University of Tennessee and The University
>> > #                         of Tennessee Research Foundation.  All rights
>> > #                         reserved.
>> > # Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
>> > #                         University of Stuttgart.  All rights reserved.
>> > # Copyright (c) 2004-2005 The Regents of the University of California.
>> > #                         All rights reserved.
>> > # $COPYRIGHT$
>> > #
>> > # Additional copyrights may follow
>> > #
>> > # $HEADER$
>> > #
>> > # This is the default hostfile for Open MPI.  Notice that it does not
>> > # contain any hosts (not even localhost).  This file should only
>> > # contain hosts if a system administrator wants users to always have
>> > # the same set of default hosts, and is not using a batch scheduler
>> > # (such as SLURM, PBS, etc.).
>> > #
>> > # Note that this file is *not* used when running in "managed"
>> > # environments (e.g., running in a job under a job scheduler, such as
>> > # SLURM or PBS / Torque).
>> > #
>> > # If you are primarily interested in running Open MPI on one node, you
>> > # should *not* simply list "localhost" in here (contrary to prior MPI
>> > # implementations, such as LAM/MPI).  A localhost-only node list is
>> > # created by the RAS component named "localhost" if no other RAS
>> > # components were able to find any hosts to run on (this behavior can
>> > # be disabled by excluding the localhost RAS component by specifying
>> > # the value "^localhost" [without the quotes] to the "ras" MCA
>> > # parameter).
>> >
>> > (15:53:52) [zorg]:~>
>> >
>> > Thanks!
>> >
>> > Eric
>> >
>> >
>> > On 25/04/17 03:52 PM, r...@open-mpi.org <mailto:r...@open-mpi.org> wrote:
>> >> What is in your hostfile?
>> >>
>> >>
>> >>> On Apr 25, 2017, at 11:39 AM, Eric Chamberland 
>> >>> <eric.chamberl...@giref.ulaval.ca 
>> >>> <mailto:eric.chamberl...@giref.ulaval.ca>> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> just testing the 3.x branch... I launch:
>> >>>
>> >>> mpirun -n 8 echo "hello"
>> >>>
>> >>> and I get:
>> >>>
>> >>> --------------------------------------------------------------------------
>> >>> There are not enough slots available in the system to satisfy the 8 slots
>> >>> that were requested by the application:
>> >>> echo
>> >>>
>> >>> Either request fewer slots for your application, or make more slots 
>> >>> available
>> >>> for use.
>> >>> --------------------------------------------------------------------------
>> >>>
>> >>> I have to oversubscribe, so what do I have to do to bypass this 
>> >>> "limitation"?
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Eric
>> >>>
>> >>> configure log:
>> >>>
>> >>> http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_config.log
>> >>>  
>> >>> <http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_config.log>
>> >>> http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_ompi_info_all.txt
>> >>>  
>> >>> <http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_ompi_info_all.txt>
>> >>>
>> >>>
>> >>> here is the complete message:
>> >>>
>> >>> [zorg:30036] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path 
>> >>> NULL
>> >>> [zorg:30036] plm:base:set_hnp_name: initial bias 30036 nodename hash 
>> >>> 810220270
>> >>> [zorg:30036] plm:base:set_hnp_name: final jobfam 49136
>> >>> [zorg:30036] [[49136,0],0] plm:rsh_setup on agent ssh : rsh path NULL
>> >>> [zorg:30036] [[49136,0],0] plm:base:receive start comm
>> >>> [zorg:30036] [[49136,0],0] plm:base:setup_job
>> >>> [zorg:30036] [[49136,0],0] plm:base:setup_vm
>> >>> [zorg:30036] [[49136,0],0] plm:base:setup_vm creating map
>> >>> [zorg:30036] [[49136,0],0] setup:vm: working unmanaged allocation
>> >>> [zorg:30036] [[49136,0],0] using default hostfile 
>> >>> /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
>> >>> [zorg:30036] [[49136,0],0] plm:base:setup_vm only HNP in allocation
>> >>> [zorg:30036] [[49136,0],0] plm:base:setting slots for node zorg by cores
>> >>> [zorg:30036] [[49136,0],0] complete_setup on job [49136,1]
>> >>> [zorg:30036] [[49136,0],0] plm:base:launch_apps for job [49136,1]
>> >>> --------------------------------------------------------------------------
>> >>> There are not enough slots available in the system to satisfy the 8 slots
>> >>> that were requested by the application:
>> >>> echo
>> >>>
>> >>> Either request fewer slots for your application, or make more slots 
>> >>> available
>> >>> for use.
>> >>> --------------------------------------------------------------------------
>> >>> [zorg:30036] [[49136,0],0] plm:base:orted_cmd sending orted_exit commands
>> >>> [zorg:30036] [[49136,0],0] plm:base:receive stop comm
>> >>>
>> >>> _______________________________________________
>> >>> users mailing list
>> >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>> >>
>> >> _______________________________________________
>> >> users mailing list
>> >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>> >>
>> > _______________________________________________
>> > users mailing list
>> > users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to