Thanks Ralph, Indeed, if I add :8 I get back the expected behavior. I can cope with this (I don't usually restrict my runs to a subset of the nodes).
George. On Tue, Apr 25, 2017 at 4:53 PM, r...@open-mpi.org <r...@open-mpi.org> wrote: > I suspect it read the file just fine - what you are seeing in the output > is a reflection of the community’s design decision that only one slot would > be allocated for each time a node is listed in -host. This is why they > added the :N modifier so you can specify the #slots to use in lieu of > writing the host name N times > > If this isn’t what you feel it should do, then please look at the files in > orte/util/dash_host and feel free to propose a modification to the > behavior. I personally am not bound to any particular answer, but I really > don’t have time to address it again. > > > > On Apr 25, 2017, at 1:35 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > > Just to be clear, the hostfile contains the correct info: > > dancer00 slots=8 > dancer01 slots=8 > > The output regarding the 2 nodes (dancer00 and dancer01) is clearly wrong. > > George. > > > > On Tue, Apr 25, 2017 at 4:32 PM, George Bosilca <bosi...@icl.utk.edu> > wrote: > >> I confirm a similar issue on a more managed environment. I have an >> hostfile that worked for the last few years, and that span across a small >> cluster (30 nodes of 8 cores each). >> >> Trying to spawn any number of processes across P nodes fails if the >> number of processes is larger than P (despite the fact that there are >> largely enough resources, and that this information is provided via the >> hostfile). >> >> George. >> >> >> $ mpirun -mca ras_base_verbose 10 --display-allocation -np 4 --host >> dancer00,dancer01 --map-by >> >> [dancer.icl.utk.edu:13457] mca: base: components_register: registering >> framework ras components >> [dancer.icl.utk.edu:13457] mca: base: components_register: found loaded >> component simulator >> [dancer.icl.utk.edu:13457] mca: base: components_register: component >> simulator register function successful >> [dancer.icl.utk.edu:13457] mca: base: components_register: found loaded >> component slurm >> [dancer.icl.utk.edu:13457] mca: base: components_register: component >> slurm register function successful >> [dancer.icl.utk.edu:13457] mca: base: components_register: found loaded >> component loadleveler >> [dancer.icl.utk.edu:13457] mca: base: components_register: component >> loadleveler register function successful >> [dancer.icl.utk.edu:13457] mca: base: components_register: found loaded >> component tm >> [dancer.icl.utk.edu:13457] mca: base: components_register: component tm >> register function successful >> [dancer.icl.utk.edu:13457] mca: base: components_open: opening ras >> components >> [dancer.icl.utk.edu:13457] mca: base: components_open: found loaded >> component simulator >> [dancer.icl.utk.edu:13457] mca: base: components_open: found loaded >> component slurm >> [dancer.icl.utk.edu:13457] mca: base: components_open: component slurm >> open function successful >> [dancer.icl.utk.edu:13457] mca: base: components_open: found loaded >> component loadleveler >> [dancer.icl.utk.edu:13457] mca: base: components_open: component >> loadleveler open function successful >> [dancer.icl.utk.edu:13457] mca: base: components_open: found loaded >> component tm >> [dancer.icl.utk.edu:13457] mca: base: components_open: component tm open >> function successful >> [dancer.icl.utk.edu:13457] mca:base:select: Auto-selecting ras components >> [dancer.icl.utk.edu:13457] mca:base:select:( ras) Querying component >> [simulator] >> [dancer.icl.utk.edu:13457] mca:base:select:( ras) Querying component >> [slurm] >> [dancer.icl.utk.edu:13457] mca:base:select:( ras) Querying component >> [loadleveler] >> [dancer.icl.utk.edu:13457] mca:base:select:( ras) Querying component >> [tm] >> [dancer.icl.utk.edu:13457] mca:base:select:( ras) No component selected! >> >> ====================== ALLOCATED NODES ====================== >> dancer00: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer01: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> ================================================================= >> >> ====================== ALLOCATED NODES ====================== >> dancer00: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer01: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN >> ================================================================= >> ------------------------------------------------------------ >> -------------- >> There are not enough slots available in the system to satisfy the 4 slots >> that were requested by the application: >> startup >> >> Either request fewer slots for your application, or make more slots >> available >> for use. >> ------------------------------------------------------------ >> -------------- >> >> >> >> >> On Tue, Apr 25, 2017 at 4:00 PM, r...@open-mpi.org <r...@open-mpi.org> >> wrote: >> >>> Okay - so effectively you have no hostfile, and no allocation. So this >>> is running just on the one node where mpirun exists? >>> >>> Add “-mca ras_base_verbose 10 --display-allocation” to your cmd line and >>> let’s see what it found >>> >>> > On Apr 25, 2017, at 12:56 PM, Eric Chamberland < >>> eric.chamberl...@giref.ulaval.ca> wrote: >>> > >>> > Hi, >>> > >>> > the host file has been constructed automatically by the >>> configuration+installation process and seems to contain only comments and a >>> blank line: >>> > >>> > (15:53:50) [zorg]:~> cat /opt/openmpi-3.x_debug/etc/ope >>> nmpi-default-hostfile >>> > # >>> > # Copyright (c) 2004-2005 The Trustees of Indiana University and >>> Indiana >>> > # University Research and Technology >>> > # Corporation. All rights reserved. >>> > # Copyright (c) 2004-2005 The University of Tennessee and The >>> University >>> > # of Tennessee Research Foundation. All rights >>> > # reserved. >>> > # Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, >>> > # University of Stuttgart. All rights >>> reserved. >>> > # Copyright (c) 2004-2005 The Regents of the University of California. >>> > # All rights reserved. >>> > # $COPYRIGHT$ >>> > # >>> > # Additional copyrights may follow >>> > # >>> > # $HEADER$ >>> > # >>> > # This is the default hostfile for Open MPI. Notice that it does not >>> > # contain any hosts (not even localhost). This file should only >>> > # contain hosts if a system administrator wants users to always have >>> > # the same set of default hosts, and is not using a batch scheduler >>> > # (such as SLURM, PBS, etc.). >>> > # >>> > # Note that this file is *not* used when running in "managed" >>> > # environments (e.g., running in a job under a job scheduler, such as >>> > # SLURM or PBS / Torque). >>> > # >>> > # If you are primarily interested in running Open MPI on one node, you >>> > # should *not* simply list "localhost" in here (contrary to prior MPI >>> > # implementations, such as LAM/MPI). A localhost-only node list is >>> > # created by the RAS component named "localhost" if no other RAS >>> > # components were able to find any hosts to run on (this behavior can >>> > # be disabled by excluding the localhost RAS component by specifying >>> > # the value "^localhost" [without the quotes] to the "ras" MCA >>> > # parameter). >>> > >>> > (15:53:52) [zorg]:~> >>> > >>> > Thanks! >>> > >>> > Eric >>> > >>> > >>> > On 25/04/17 03:52 PM, r...@open-mpi.org wrote: >>> >> What is in your hostfile? >>> >> >>> >> >>> >>> On Apr 25, 2017, at 11:39 AM, Eric Chamberland < >>> eric.chamberl...@giref.ulaval.ca> wrote: >>> >>> >>> >>> Hi, >>> >>> >>> >>> just testing the 3.x branch... I launch: >>> >>> >>> >>> mpirun -n 8 echo "hello" >>> >>> >>> >>> and I get: >>> >>> >>> >>> ------------------------------------------------------------ >>> -------------- >>> >>> There are not enough slots available in the system to satisfy the 8 >>> slots >>> >>> that were requested by the application: >>> >>> echo >>> >>> >>> >>> Either request fewer slots for your application, or make more slots >>> available >>> >>> for use. >>> >>> ------------------------------------------------------------ >>> -------------- >>> >>> >>> >>> I have to oversubscribe, so what do I have to do to bypass this >>> "limitation"? >>> >>> >>> >>> Thanks, >>> >>> >>> >>> Eric >>> >>> >>> >>> configure log: >>> >>> >>> >>> http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h >>> 46m08s_config.log >>> >>> http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h >>> 46m08s_ompi_info_all.txt >>> >>> >>> >>> >>> >>> here is the complete message: >>> >>> >>> >>> [zorg:30036] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh >>> path NULL >>> >>> [zorg:30036] plm:base:set_hnp_name: initial bias 30036 nodename hash >>> 810220270 >>> >>> [zorg:30036] plm:base:set_hnp_name: final jobfam 49136 >>> >>> [zorg:30036] [[49136,0],0] plm:rsh_setup on agent ssh : rsh path NULL >>> >>> [zorg:30036] [[49136,0],0] plm:base:receive start comm >>> >>> [zorg:30036] [[49136,0],0] plm:base:setup_job >>> >>> [zorg:30036] [[49136,0],0] plm:base:setup_vm >>> >>> [zorg:30036] [[49136,0],0] plm:base:setup_vm creating map >>> >>> [zorg:30036] [[49136,0],0] setup:vm: working unmanaged allocation >>> >>> [zorg:30036] [[49136,0],0] using default hostfile >>> /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile >>> >>> [zorg:30036] [[49136,0],0] plm:base:setup_vm only HNP in allocation >>> >>> [zorg:30036] [[49136,0],0] plm:base:setting slots for node zorg by >>> cores >>> >>> [zorg:30036] [[49136,0],0] complete_setup on job [49136,1] >>> >>> [zorg:30036] [[49136,0],0] plm:base:launch_apps for job [49136,1] >>> >>> ------------------------------------------------------------ >>> -------------- >>> >>> There are not enough slots available in the system to satisfy the 8 >>> slots >>> >>> that were requested by the application: >>> >>> echo >>> >>> >>> >>> Either request fewer slots for your application, or make more slots >>> available >>> >>> for use. >>> >>> ------------------------------------------------------------ >>> -------------- >>> >>> [zorg:30036] [[49136,0],0] plm:base:orted_cmd sending orted_exit >>> commands >>> >>> [zorg:30036] [[49136,0],0] plm:base:receive stop comm >>> >>> >>> >>> _______________________________________________ >>> >>> users mailing list >>> >>> users@lists.open-mpi.org >>> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> >> >>> >> _______________________________________________ >>> >> users mailing list >>> >> users@lists.open-mpi.org >>> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> >> >>> > _______________________________________________ >>> > users mailing list >>> > users@lists.open-mpi.org >>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> >> >> > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users