I confirm a similar issue on a more managed environment. I have an hostfile that worked for the last few years, and that span across a small cluster (30 nodes of 8 cores each).
Trying to spawn any number of processes across P nodes fails if the number of processes is larger than P (despite the fact that there are largely enough resources, and that this information is provided via the hostfile). George. $ mpirun -mca ras_base_verbose 10 --display-allocation -np 4 --host dancer00,dancer01 --map-by [dancer.icl.utk.edu:13457] mca: base: components_register: registering framework ras components [dancer.icl.utk.edu:13457] mca: base: components_register: found loaded component simulator [dancer.icl.utk.edu:13457] mca: base: components_register: component simulator register function successful [dancer.icl.utk.edu:13457] mca: base: components_register: found loaded component slurm [dancer.icl.utk.edu:13457] mca: base: components_register: component slurm register function successful [dancer.icl.utk.edu:13457] mca: base: components_register: found loaded component loadleveler [dancer.icl.utk.edu:13457] mca: base: components_register: component loadleveler register function successful [dancer.icl.utk.edu:13457] mca: base: components_register: found loaded component tm [dancer.icl.utk.edu:13457] mca: base: components_register: component tm register function successful [dancer.icl.utk.edu:13457] mca: base: components_open: opening ras components [dancer.icl.utk.edu:13457] mca: base: components_open: found loaded component simulator [dancer.icl.utk.edu:13457] mca: base: components_open: found loaded component slurm [dancer.icl.utk.edu:13457] mca: base: components_open: component slurm open function successful [dancer.icl.utk.edu:13457] mca: base: components_open: found loaded component loadleveler [dancer.icl.utk.edu:13457] mca: base: components_open: component loadleveler open function successful [dancer.icl.utk.edu:13457] mca: base: components_open: found loaded component tm [dancer.icl.utk.edu:13457] mca: base: components_open: component tm open function successful [dancer.icl.utk.edu:13457] mca:base:select: Auto-selecting ras components [dancer.icl.utk.edu:13457] mca:base:select:( ras) Querying component [simulator] [dancer.icl.utk.edu:13457] mca:base:select:( ras) Querying component [slurm] [dancer.icl.utk.edu:13457] mca:base:select:( ras) Querying component [loadleveler] [dancer.icl.utk.edu:13457] mca:base:select:( ras) Querying component [tm] [dancer.icl.utk.edu:13457] mca:base:select:( ras) No component selected! ====================== ALLOCATED NODES ====================== dancer00: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN dancer01: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN ================================================================= ====================== ALLOCATED NODES ====================== dancer00: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN dancer01: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN ================================================================= -------------------------------------------------------------------------- There are not enough slots available in the system to satisfy the 4 slots that were requested by the application: startup Either request fewer slots for your application, or make more slots available for use. -------------------------------------------------------------------------- On Tue, Apr 25, 2017 at 4:00 PM, r...@open-mpi.org <r...@open-mpi.org> wrote: > Okay - so effectively you have no hostfile, and no allocation. So this is > running just on the one node where mpirun exists? > > Add “-mca ras_base_verbose 10 --display-allocation” to your cmd line and > let’s see what it found > > > On Apr 25, 2017, at 12:56 PM, Eric Chamberland <Eric.Chamberland@giref. > ulaval.ca> wrote: > > > > Hi, > > > > the host file has been constructed automatically by the > configuration+installation process and seems to contain only comments and a > blank line: > > > > (15:53:50) [zorg]:~> cat /opt/openmpi-3.x_debug/etc/ > openmpi-default-hostfile > > # > > # Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana > > # University Research and Technology > > # Corporation. All rights reserved. > > # Copyright (c) 2004-2005 The University of Tennessee and The University > > # of Tennessee Research Foundation. All rights > > # reserved. > > # Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, > > # University of Stuttgart. All rights reserved. > > # Copyright (c) 2004-2005 The Regents of the University of California. > > # All rights reserved. > > # $COPYRIGHT$ > > # > > # Additional copyrights may follow > > # > > # $HEADER$ > > # > > # This is the default hostfile for Open MPI. Notice that it does not > > # contain any hosts (not even localhost). This file should only > > # contain hosts if a system administrator wants users to always have > > # the same set of default hosts, and is not using a batch scheduler > > # (such as SLURM, PBS, etc.). > > # > > # Note that this file is *not* used when running in "managed" > > # environments (e.g., running in a job under a job scheduler, such as > > # SLURM or PBS / Torque). > > # > > # If you are primarily interested in running Open MPI on one node, you > > # should *not* simply list "localhost" in here (contrary to prior MPI > > # implementations, such as LAM/MPI). A localhost-only node list is > > # created by the RAS component named "localhost" if no other RAS > > # components were able to find any hosts to run on (this behavior can > > # be disabled by excluding the localhost RAS component by specifying > > # the value "^localhost" [without the quotes] to the "ras" MCA > > # parameter). > > > > (15:53:52) [zorg]:~> > > > > Thanks! > > > > Eric > > > > > > On 25/04/17 03:52 PM, r...@open-mpi.org wrote: > >> What is in your hostfile? > >> > >> > >>> On Apr 25, 2017, at 11:39 AM, Eric Chamberland < > eric.chamberl...@giref.ulaval.ca> wrote: > >>> > >>> Hi, > >>> > >>> just testing the 3.x branch... I launch: > >>> > >>> mpirun -n 8 echo "hello" > >>> > >>> and I get: > >>> > >>> ------------------------------------------------------------ > -------------- > >>> There are not enough slots available in the system to satisfy the 8 > slots > >>> that were requested by the application: > >>> echo > >>> > >>> Either request fewer slots for your application, or make more slots > available > >>> for use. > >>> ------------------------------------------------------------ > -------------- > >>> > >>> I have to oversubscribe, so what do I have to do to bypass this > "limitation"? > >>> > >>> Thanks, > >>> > >>> Eric > >>> > >>> configure log: > >>> > >>> http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25. > 10h46m08s_config.log > >>> http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25. > 10h46m08s_ompi_info_all.txt > >>> > >>> > >>> here is the complete message: > >>> > >>> [zorg:30036] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh > path NULL > >>> [zorg:30036] plm:base:set_hnp_name: initial bias 30036 nodename hash > 810220270 > >>> [zorg:30036] plm:base:set_hnp_name: final jobfam 49136 > >>> [zorg:30036] [[49136,0],0] plm:rsh_setup on agent ssh : rsh path NULL > >>> [zorg:30036] [[49136,0],0] plm:base:receive start comm > >>> [zorg:30036] [[49136,0],0] plm:base:setup_job > >>> [zorg:30036] [[49136,0],0] plm:base:setup_vm > >>> [zorg:30036] [[49136,0],0] plm:base:setup_vm creating map > >>> [zorg:30036] [[49136,0],0] setup:vm: working unmanaged allocation > >>> [zorg:30036] [[49136,0],0] using default hostfile > /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile > >>> [zorg:30036] [[49136,0],0] plm:base:setup_vm only HNP in allocation > >>> [zorg:30036] [[49136,0],0] plm:base:setting slots for node zorg by > cores > >>> [zorg:30036] [[49136,0],0] complete_setup on job [49136,1] > >>> [zorg:30036] [[49136,0],0] plm:base:launch_apps for job [49136,1] > >>> ------------------------------------------------------------ > -------------- > >>> There are not enough slots available in the system to satisfy the 8 > slots > >>> that were requested by the application: > >>> echo > >>> > >>> Either request fewer slots for your application, or make more slots > available > >>> for use. > >>> ------------------------------------------------------------ > -------------- > >>> [zorg:30036] [[49136,0],0] plm:base:orted_cmd sending orted_exit > commands > >>> [zorg:30036] [[49136,0],0] plm:base:receive stop comm > >>> > >>> _______________________________________________ > >>> users mailing list > >>> users@lists.open-mpi.org > >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > >> > >> _______________________________________________ > >> users mailing list > >> users@lists.open-mpi.org > >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > >> > > _______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users