Ok, this is a good / consistent output. That being said, I don't grok what is happening here: it says it finds 2 slots, but then it tells you it doesn't have enough slots.
Let me dig deeper and get back to you... -- Jeff Squyres jsquy...@cisco.com ________________________________ From: timesir <mrlong...@gmail.com> Sent: Friday, November 18, 2022 10:20 AM To: Jeff Squyres (jsquyres) <jsquy...@cisco.com>; users@lists.open-mpi.org <users@lists.open-mpi.org>; gilles.gouaillar...@gmail.com <gilles.gouaillar...@gmail.com> Subject: Re: users Digest, Vol 4818, Issue 1 (py3.9) ➜ /share ompi_info --version Open MPI v5.0.0rc9 https://www.open-mpi.org/community/help/ (py3.9) ➜ /share cat hosts 192.168.180.48 slots=1 192.168.60.203 slots=1 (py3.9) ➜ /share mpirun -n 2 --machinefile hosts --mca plm_base_verbose 100 --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 uptime [computer01:53933] mca: base: component_find: searching NULL for plm components [computer01:53933] mca: base: find_dyn_components: checking NULL for plm components [computer01:53933] pmix:mca: base: components_register: registering framework plm components [computer01:53933] pmix:mca: base: components_register: found loaded component slurm [computer01:53933] pmix:mca: base: components_register: component slurm register function successful [computer01:53933] pmix:mca: base: components_register: found loaded component ssh [computer01:53933] pmix:mca: base: components_register: component ssh register function successful [computer01:53933] mca: base: components_open: opening plm components [computer01:53933] mca: base: components_open: found loaded component slurm [computer01:53933] mca: base: components_open: component slurm open function successful [computer01:53933] mca: base: components_open: found loaded component ssh [computer01:53933] mca: base: components_open: component ssh open function successful [computer01:53933] mca:base:select: Auto-selecting plm components [computer01:53933] mca:base:select:( plm) Querying component [slurm] [computer01:53933] mca:base:select:( plm) Querying component [ssh] [computer01:53933] mca:base:select:( plm) Query of component [ssh] set priority to 10 [computer01:53933] mca:base:select:( plm) Selected component [ssh] [computer01:53933] mca: base: close: component slurm closed [computer01:53933] mca: base: close: unloading component slurm [computer01:53933] mca: base: component_find: searching NULL for ras components [computer01:53933] mca: base: find_dyn_components: checking NULL for ras components [computer01:53933] pmix:mca: base: components_register: registering framework ras components [computer01:53933] pmix:mca: base: components_register: found loaded component simulator [computer01:53933] pmix:mca: base: components_register: component simulator register function successful [computer01:53933] pmix:mca: base: components_register: found loaded component pbs [computer01:53933] pmix:mca: base: components_register: component pbs register function successful [computer01:53933] pmix:mca: base: components_register: found loaded component slurm [computer01:53933] pmix:mca: base: components_register: component slurm register function successful [computer01:53933] mca: base: components_open: opening ras components [computer01:53933] mca: base: components_open: found loaded component simulator [computer01:53933] mca: base: components_open: found loaded component pbs [computer01:53933] mca: base: components_open: component pbs open function successful [computer01:53933] mca: base: components_open: found loaded component slurm [computer01:53933] mca: base: components_open: component slurm open function successful [computer01:53933] mca:base:select: Auto-selecting ras components [computer01:53933] mca:base:select:( ras) Querying component [simulator] [computer01:53933] mca:base:select:( ras) Querying component [pbs] [71/1815] [computer01:53933] mca:base:select:( ras) Querying component [slurm] [computer01:53933] mca:base:select:( ras) No component selected! [computer01:53933] mca: base: component_find: searching NULL for rmaps components [computer01:53933] mca: base: find_dyn_components: checking NULL for rmaps components [computer01:53933] pmix:mca: base: components_register: registering framework rmaps components [computer01:53933] pmix:mca: base: components_register: found loaded component ppr [computer01:53933] pmix:mca: base: components_register: component ppr register function successful [computer01:53933] pmix:mca: base: components_register: found loaded component rank_file [computer01:53933] pmix:mca: base: components_register: component rank_file has no register or open function [computer01:53933] pmix:mca: base: components_register: found loaded component round_robin [computer01:53933] pmix:mca: base: components_register: component round_robin register function successful [computer01:53933] pmix:mca: base: components_register: found loaded component seq [computer01:53933] pmix:mca: base: components_register: component seq register function successful [computer01:53933] mca: base: components_open: opening rmaps components [computer01:53933] mca: base: components_open: found loaded component ppr [computer01:53933] mca: base: components_open: component ppr open function successful [computer01:53933] mca: base: components_open: found loaded component rank_file [computer01:53933] mca: base: components_open: found loaded component round_robin [computer01:53933] mca: base: components_open: component round_robin open function successful [computer01:53933] mca: base: components_open: found loaded component seq [computer01:53933] mca: base: components_open: component seq open function successful [computer01:53933] mca:rmaps:select: checking available component ppr [computer01:53933] mca:rmaps:select: Querying component [ppr] [computer01:53933] mca:rmaps:select: checking available component rank_file [computer01:53933] mca:rmaps:select: Querying component [rank_file] [computer01:53933] mca:rmaps:select: checking available component round_robin [computer01:53933] mca:rmaps:select: Querying component [round_robin] [computer01:53933] mca:rmaps:select: checking available component seq [computer01:53933] mca:rmaps:select: Querying component [seq] [computer01:53933] [prterun-computer01-53933@0,0]: Final mapper priorities [computer01:53933] Mapper: ppr Priority: 90 [computer01:53933] Mapper: seq Priority: 60 [computer01:53933] Mapper: round_robin Priority: 10 [computer01:53933] Mapper: rank_file Priority: 0 ====================== ALLOCATED NODES ====================== computer01: slots=1 max_slots=0 slots_inuse=0 state=UP Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN [33/1815] aliases: 192.168.180.48 192.168.60.203<http://192.168.60.203>: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN Flags: SLOTS_GIVEN aliases: NONE ================================================================= [computer01:53933] [prterun-computer01-53933@0,0] plm:ssh: final template argv: /usr/bin/ssh <template> PRTE_PREFIX=/usr/local/openmpi;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_ PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;exportDYLD_LIBRARY_PATH;/usr/local/openmpi/b in/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-computer01-53933@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "2" -- prtemca prte_hnp_uri "prterun-computer01-53933@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:42567:24,16<mailto:prterun-computer01-53933@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:42567:24,16> ,24,24,24,24" --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca ras_base_verbose "100" --prtemca pmix_session_server "1" --prtem ca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-computer01-53933@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.1<mailto:prterun-computer01-53933@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.1> 0.144,192.168.122.1:42567:24,16,24,24,24,24" [computer01:53933] ALIASES FOR NODE computer02 (omputer02) [computer01:53933] ALIAS: 192.168.60.203 [computer01:53933] ALIAS: computer02 [computer01:53933] ALIAS: 172.17.180.203 [computer01:53933] ALIAS: 172.168.10.23 [computer01:53933] ALIAS: 172.168.10.143 ====================== ALLOCATED NODES ====================== computer01: slots=1 max_slots=0 slots_inuse=0 state=UP Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN aliases: 192.168.180.48 computer02: slots=1 max_slots=0 slots_inuse=0 state=UP Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN aliases: 192.168.60.203,computer02,172.17.180.203,172.168.10.23,172.168.10.143 ================================================================= [computer01:53933] mca:rmaps: mapping job prterun-computer01-53933@1 [computer01:53933] mca:rmaps: setting mapping policies for job prterun-computer01-53933@1 inherit TRUE hwtcpus FALSE [computer01:53933] mca:rmaps[358] mapping not given - using bycore [computer01:53933] setdefaultbinding[365] binding not given - using bycore [computer01:53933] mca:rmaps:ppr: job prterun-computer01-53933@1 not using ppr mapper PPR NULL policy PPR NOTSET [computer01:53933] mca:rmaps:seq: job prterun-computer01-53933@1 not using seq mapper [computer01:53933] mca:rmaps:rr: mapping job prterun-computer01-53933@1 [computer01:53933] AVAILABLE NODES FOR MAPPING: [computer01:53933] node: computer01 daemon: 0 slots_available: 1 [computer01:53933] mca:rmaps:rr: mapping by Core for job prterun-computer01-53933@1 slots 1 num_procs 2 -------------------------------------------------------------------------- There are not enough slots available in the system to satisfy the 2 slots that were requested by the application: uptime Either request fewer procs for your application, or make more slots available for use. A "slot" is the PRRTE term for an allocatable unit where we can launch a process. The number of slots available are defined by the environment in which PRRTE processes are run: 1. Hostfile, via "slots=N" clauses (N defaults to number of processor cores if not provided) 2. The --host command line parameter, via a ":N" suffix on the hostname (N defaults to 1 if not provided) 3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.) 4. If none of a hostfile, the --host command line parameter, or an RM is present, PRRTE defaults to the number of processor cores In all the above cases, if you want PRRTE to default to the number of hardware threads instead of the number of processor cores, use the --use-hwthread-cpus option. Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the number of available slots when deciding the number of processes to launch. -------------------------------------------------------------------------- [computer01:53933] mca: base: close: component ssh closed [computer01:53933] mca: base: close: unloading component ssh 在 2022/11/18 22:48, Jeff Squyres (jsquyres) 写道: Thanks for the output. I'm seeing inconsistent output between your different outputs, however. For example, one of your outputs seems to ignore the hostfile and only show slots on the local host, but another output shows 2 hosts with 1 slot each. But I don't know what was in the hosts file for that run. Also, I see a weird "state=UNKNOWN" in the output in the 2nd node. Not sure what that means; we might need to track that down. Can you send the output from these commands, in a single session (I added another MCA verbose parameter in here, too): ompi_info --version cat hosts mpirun -n 2 --machinefile hosts --mca plm_base_verbose 100 --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 uptime Make sure to use "dash dash" before the CLI options; ensure that copy-and-paste from email doesn't replace the dashes with non-ASCII dashes, such as an "em dash", or somesuch. -- Jeff Squyres jsquy...@cisco.com<mailto:jsquy...@cisco.com> ________________________________ From: timesir <mrlong...@gmail.com><mailto:mrlong...@gmail.com> Sent: Friday, November 18, 2022 8:59 AM To: Jeff Squyres (jsquyres) <jsquy...@cisco.com><mailto:jsquy...@cisco.com>; users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> <users@lists.open-mpi.org><mailto:users@lists.open-mpi.org>; gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com> <gilles.gouaillar...@gmail.com><mailto:gilles.gouaillar...@gmail.com> Subject: Re: users Digest, Vol 4818, Issue 1 The ompi_info -all output for both machines is attached. 在 2022/11/18 21:54, Jeff Squyres (jsquyres) 写道: I see 2 config.log files -- can you also send the other information requested on that page? I.e, the version you're using (I think you said in a prior email that it was 5.0rc9, but I'm not 100% sure), and the output from ompi_info --all. -- Jeff Squyres jsquy...@cisco.com<mailto:jsquy...@cisco.com> ________________________________ From: timesir <mrlong...@gmail.com><mailto:mrlong...@gmail.com> Sent: Friday, November 18, 2022 8:49 AM To: Jeff Squyres (jsquyres) <jsquy...@cisco.com><mailto:jsquy...@cisco.com>; users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> <users@lists.open-mpi.org><mailto:users@lists.open-mpi.org>; gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com> <gilles.gouaillar...@gmail.com><mailto:gilles.gouaillar...@gmail.com> Subject: Re: users Digest, Vol 4818, Issue 1 The information you need is attached. 在 2022/11/18 21:08, Jeff Squyres (jsquyres) 写道: Yes, Gilles responded within a few hours: https://www.mail-archive.com/users@lists.open-mpi.org/msg35057.html Looking closer, we should still be seeing more output compared to what you posted. It's almost like you have a busted Open MPI installation -- perhaps it's missing the "hostfile" component altogether. How did you install Open MPI? Can you send the information from "Run time problems" on https://docs.open-mpi.org/en/v5.0.x/getting-help.html#for-run-time-problems ? -- Jeff Squyres jsquy...@cisco.com<mailto:jsquy...@cisco.com> ________________________________ From: timesir <mrlong...@gmail.com><mailto:mrlong...@gmail.com> Sent: Monday, November 14, 2022 11:32 PM To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> <users@lists.open-mpi.org><mailto:users@lists.open-mpi.org>; Jeff Squyres (jsquyres) <jsquy...@cisco.com><mailto:jsquy...@cisco.com>; gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com> <gilles.gouaillar...@gmail.com><mailto:gilles.gouaillar...@gmail.com> Subject: Re: users Digest, Vol 4818, Issue 1 (py3.9) ➜ /share mpirun -n 2 --machinefile hosts --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 which mpirun [computer01:39342] mca: base: component_find: searching NULL for ras components [computer01:39342] mca: base: find_dyn_components: checking NULL for ras components [computer01:39342] pmix:mca: base: components_register: registering framework ras components [computer01:39342] pmix:mca: base: components_register: found loaded component simulator [computer01:39342] pmix:mca: base: components_register: component simulator register function successful [computer01:39342] pmix:mca: base: components_register: found loaded component pbs [computer01:39342] pmix:mca: base: components_register: component pbs register function successful [computer01:39342] pmix:mca: base: components_register: found loaded component slurm [computer01:39342] pmix:mca: base: components_register: component slurm register function successful [computer01:39342] mca: base: components_open: opening ras components [computer01:39342] mca: base: components_open: found loaded component simulator [computer01:39342] mca: base: components_open: found loaded component pbs [computer01:39342] mca: base: components_open: component pbs open function successful [computer01:39342] mca: base: components_open: found loaded component slurm [computer01:39342] mca: base: components_open: component slurm open function successful [computer01:39342] mca:base:select: Auto-selecting ras components [computer01:39342] mca:base:select:( ras) Querying component [simulator] [computer01:39342] mca:base:select:( ras) Querying component [pbs] [computer01:39342] mca:base:select:( ras) Querying component [slurm] [computer01:39342] mca:base:select:( ras) No component selected! [computer01:39342] mca: base: component_find: searching NULL for rmaps components [computer01:39342] mca: base: find_dyn_components: checking NULL for rmaps components [computer01:39342] pmix:mca: base: components_register: registering framework rmaps components [computer01:39342] pmix:mca: base: components_register: found loaded component ppr [computer01:39342] pmix:mca: base: components_register: component ppr register function successful [computer01:39342] pmix:mca: base: components_register: found loaded component rank_file [computer01:39342] pmix:mca: base: components_register: component rank_file has no register or open function [computer01:39342] pmix:mca: base: components_register: found loaded component round_robin [computer01:39342] pmix:mca: base: components_register: component round_robin register function successful [computer01:39342] pmix:mca: base: components_register: found loaded component seq [computer01:39342] pmix:mca: base: components_register: component seq register function successful [computer01:39342] mca: base: components_open: opening rmaps components [computer01:39342] mca: base: components_open: found loaded component ppr [computer01:39342] mca: base: components_open: component ppr open function successful [computer01:39342] mca: base: components_open: found loaded component rank_file [computer01:39342] mca: base: components_open: found loaded component round_robin [computer01:39342] mca: base: components_open: component round_robin open function successful [computer01:39342] mca: base: components_open: found loaded component seq [35/405] [computer01:39342] mca: base: components_open: component seq open function successful [computer01:39342] mca:rmaps:select: checking available component ppr [computer01:39342] mca:rmaps:select: Querying component [ppr] [computer01:39342] mca:rmaps:select: checking available component rank_file [computer01:39342] mca:rmaps:select: Querying component [rank_file] [computer01:39342] mca:rmaps:select: checking available component round_robin [computer01:39342] mca:rmaps:select: Querying component [round_robin] [computer01:39342] mca:rmaps:select: checking available component seq [computer01:39342] mca:rmaps:select: Querying component [seq] [computer01:39342] [prterun-computer01-39342@0,0]: Final mapper priorities [computer01:39342] Mapper: ppr Priority: 90 [computer01:39342] Mapper: seq Priority: 60 [computer01:39342] Mapper: round_robin Priority: 10 [computer01:39342] Mapper: rank_file Priority: 0 ====================== ALLOCATED NODES ====================== computer01: slots=1 max_slots=0 slots_inuse=0 state=UP Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN aliases: 192.168.180.48 192.168.60.203<http://192.168.60.203>: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN Flags: SLOTS_GIVEN aliases: NONE ================================================================= ====================== ALLOCATED NODES ====================== computer01: slots=1 max_slots=0 slots_inuse=0 state=UP Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN aliases: 192.168.180.48 hepslustretest03: slots=1 max_slots=0 slots_inuse=0 state=UP Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN aliases: 192.168.60.203,hepslustretest03.ihep.ac.cn<http://hepslustretest03.ihep.ac.cn>,172.17.180.203,172.168.10.23,172.168.10.143 ================================================================= [computer01:39342] mca:rmaps: mapping job prterun-computer01-39342@1 [computer01:39342] mca:rmaps: setting mapping policies for job prterun-computer01-39342@1 inherit TRUE hwtcpus FALSE [computer01:39342] mca:rmaps[358] mapping not given - using bycore [computer01:39342] setdefaultbinding[365] binding not given - using bycore [computer01:39342] mca:rmaps:ppr: job prterun-computer01-39342@1 not using ppr mapper PPR NULL policy PPR NOTSET [computer01:39342] mca:rmaps:seq: job prterun-computer01-39342@1 not using seq mapper [computer01:39342] mca:rmaps:rr: mapping job prterun-computer01-39342@1 [computer01:39342] AVAILABLE NODES FOR MAPPING: [computer01:39342] node: computer01 daemon: 0 slots_available: 1 [computer01:39342] mca:rmaps:rr: mapping by Core for job prterun-computer01-39342@1 slots 1 num_procs 2 -------------------------------------------------------------------------- There are not enough slots available in the system to satisfy the 2 slots that were requested by the application: which Either request fewer procs for your application, or make more slots available for use. A "slot" is the PRRTE term for an allocatable unit where we can launch a process. The number of slots available are defined by the environment in which PRRTE processes are run: 1. Hostfile, via "slots=N" clauses (N defaults to number of processor cores if not provided) 2. The --host command line parameter, via a ":N" suffix on the hostname (N defaults to 1 if not provided) 3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.) 4. If none of a hostfile, the --host command line parameter, or an RM is present, PRRTE defaults to the number of processor cores In all the above cases, if you want PRRTE to default to the number of hardware threads instead of the number of processor cores, use the --use-hwthread-cpus option. Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the number of available slots when deciding the number of processes to launch. -------------------------------------------------------------------------- 在 2022/11/15 02:04, users-requ...@lists.open-mpi.org<mailto:users-requ...@lists.open-mpi.org> 写道: Send users mailing list submissions to users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> To subscribe or unsubscribe via the World Wide Web, visit https://lists.open-mpi.org/mailman/listinfo/users or, via email, send a message with subject or body 'help' to users-requ...@lists.open-mpi.org<mailto:users-requ...@lists.open-mpi.org> You can reach the person managing the list at users-ow...@lists.open-mpi.org<mailto:users-ow...@lists.open-mpi.org> When replying, please edit your Subject line so it is more specific than "Re: Contents of users digest..." Today's Topics: 1. Re: [OMPI devel] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application (Jeff Squyres (jsquyres)) 2. Re: Tracing of openmpi internal functions (Jeff Squyres (jsquyres)) ---------------------------------------------------------------------- Message: 1 Date: Mon, 14 Nov 2022 17:04:24 +0000 From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com><mailto:jsquy...@cisco.com> To: Open MPI Users <users@lists.open-mpi.org><mailto:users@lists.open-mpi.org> Subject: Re: [OMPI users] [OMPI devel] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application Message-ID: <bl0pr11mb29801261edb4fd0e9ef2f4ecc0...@bl0pr11mb2980.namprd11.prod.outlook.com><mailto:bl0pr11mb29801261edb4fd0e9ef2f4ecc0...@bl0pr11mb2980.namprd11.prod.outlook.com> Content-Type: text/plain; charset="utf-8" Yes, somehow I'm not seeing all the output that I expect to see. Can you ensure that if you're copy-and-pasting from the email, that it's actually using "dash dash" in front of "mca" and "machinefile" (vs. a copy-and-pasted "em dash")? -- Jeff Squyres jsquy...@cisco.com<mailto:jsquy...@cisco.com> ________________________________ From: users <users-boun...@lists.open-mpi.org><mailto:users-boun...@lists.open-mpi.org> on behalf of Gilles Gouaillardet via users <users@lists.open-mpi.org><mailto:users@lists.open-mpi.org> Sent: Sunday, November 13, 2022 9:18 PM To: Open MPI Users <users@lists.open-mpi.org><mailto:users@lists.open-mpi.org> Cc: Gilles Gouaillardet <gilles.gouaillar...@gmail.com><mailto:gilles.gouaillar...@gmail.com> Subject: Re: [OMPI users] [OMPI devel] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application There is a typo in your command line. You should use --mca (minus minus) instead of -mca Also, you can try --machinefile instead of -machinefile Cheers, Gilles There are not enough slots available in the system to satisfy the 2 slots that were requested by the application: ?mca On Mon, Nov 14, 2022 at 11:04 AM timesir via users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org><mailto:users@lists.open-mpi.org><mailto:users@lists.open-mpi.org>> wrote: (py3.9) ? /share mpirun -n 2 -machinefile hosts ?mca rmaps_base_verbose 100 --mca ras_base_verbose 100 which mpirun [computer01:04570] mca: base: component_find: searching NULL for ras components [computer01:04570] mca: base: find_dyn_components: checking NULL for ras components [computer01:04570] pmix:mca: base: components_register: registering framework ras components [computer01:04570] pmix:mca: base: components_register: found loaded component simulator [computer01:04570] pmix:mca: base: components_register: component simulator register function successful [computer01:04570] pmix:mca: base: components_register: found loaded component pbs [computer01:04570] pmix:mca: base: components_register: component pbs register function successful [computer01:04570] pmix:mca: base: components_register: found loaded component slurm [computer01:04570] pmix:mca: base: components_register: component slurm register function successful [computer01:04570] mca: base: components_open: opening ras components [computer01:04570] mca: base: components_open: found loaded component simulator [computer01:04570] mca: base: components_open: found loaded component pbs [computer01:04570] mca: base: components_open: component pbs open function successful [computer01:04570] mca: base: components_open: found loaded component slurm [computer01:04570] mca: base: components_open: component slurm open function successful [computer01:04570] mca:base:select: Auto-selecting ras components [computer01:04570] mca:base:select:( ras) Querying component [simulator] [computer01:04570] mca:base:select:( ras) Querying component [pbs] [computer01:04570] mca:base:select:( ras) Querying component [slurm] [computer01:04570] mca:base:select:( ras) No component selected! ====================== ALLOCATED NODES ====================== [10/1444] computer01: slots=1 max_slots=0 slots_inuse=0 state=UP Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN aliases: 192.168.180.48 192.168.60.203<http://192.168.60.203><http://192.168.60.203>: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN Flags: SLOTS_GIVEN aliases: NONE ================================================================= ====================== ALLOCATED NODES ====================== computer01: slots=1 max_slots=0 slots_inuse=0 state=UP Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN aliases: 192.168.180.48 hepslustretest03: slots=1 max_slots=0 slots_inuse=0 state=UP Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN aliases: 192.168.60.203,172.17.180.203,172.168.10.23,172.168.10.143 ================================================================= -------------------------------------------------------------------------- There are not enough slots available in the system to satisfy the 2 slots that were requested by the application: ?mca Either request fewer procs for your application, or make more slots available for use. A "slot" is the PRRTE term for an allocatable unit where we can launch a process. The number of slots available are defined by the environment in which PRRTE processes are run: 1. Hostfile, via "slots=N" clauses (N defaults to number of processor cores if not provided) 2. The --host command line parameter, via a ":N" suffix on the hostname (N defaults to 1 if not provided) 3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.) 4. If none of a hostfile, the --host command line parameter, or an RM is present, PRRTE defaults to the number of processor cores In all the above cases, if you want PRRTE to default to the number of hardware threads instead of the number of processor cores, use the --use-hwthread-cpus option. Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the number of available slots when deciding the number of processes to launch. -------------------------------------------------------------------------- ? 2022/11/13 23:42, Jeff Squyres (jsquyres) ??: Interesting. It says: [computer01:106117] AVAILABLE NODES FOR MAPPING: [computer01:106117] node: computer01 daemon: 0 slots_available: 1 This is why it tells you you're out of slots: you're asking for 2, but it only found 1. This means it's not seeing your hostfile somehow. I should have asked you to run with 2? variables last time -- can you re-run with "mpirun --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 ..."? Turning on the RAS verbosity should show us what the hostfile component is doing. -- Jeff Squyres jsquy...@cisco.com<mailto:jsquy...@cisco.com><mailto:jsquy...@cisco.com><mailto:jsquy...@cisco.com> ________________________________ From: ?? <mrlong...@gmail.com><mailto:mrlong...@gmail.com><mailto:mrlong...@gmail.com><mailto:mrlong...@gmail.com> Sent: Sunday, November 13, 2022 3:13 AM To: Jeff Squyres (jsquyres) <jsquy...@cisco.com><mailto:jsquy...@cisco.com><mailto:jsquy...@cisco.com><mailto:jsquy...@cisco.com>; Open MPI Users <users@lists.open-mpi.org><mailto:users@lists.open-mpi.org><mailto:users@lists.open-mpi.org><mailto:users@lists.open-mpi.org> Subject: Re: [OMPI devel] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application (py3.9) ? /share mpirun ?version mpirun (Open MPI) 5.0.0rc9 Report bugs to https://www.open-mpi.org/community/help/ (py3.9) ? /share cat hosts 192.168.180.48 slots=1 192.168.60.203 slots=1 (py3.9) ? /share mpirun -n 2 -machinefile hosts ?mca rmaps_base_verbose 100 which mpirun [computer01:106117] mca: base: component_find: searching NULL for rmaps components [computer01:106117] mca: base: find_dyn_components: checking NULL for rmaps components [computer01:106117] pmix:mca: base: components_register: registering framework rmaps components [computer01:106117] pmix:mca: base: components_register: found loaded component ppr [computer01:106117] pmix:mca: base: components_register: component ppr register function successful [computer01:106117] pmix:mca: base: components_register: found loaded component rank_file [computer01:106117] pmix:mca: base: components_register: component rank_file has no register or open function [computer01:106117] pmix:mca: base: components_register: found loaded component round_robin [computer01:106117] pmix:mca: base: components_register: component round_robin register function successful [computer01:106117] pmix:mca: base: components_register: found loaded component seq [computer01:106117] pmix:mca: base: components_register: component seq register function successful [computer01:106117] mca: base: components_open: opening rmaps components [computer01:106117] mca: base: components_open: found loaded component ppr [computer01:106117] mca: base: components_open: component ppr open function successful [computer01:106117] mca: base: components_open: found loaded component rank_file [computer01:106117] mca: base: components_open: found loaded component round_robin [computer01:106117] mca: base: components_open: component round_robin open function successful [computer01:106117] mca: base: components_open: found loaded component seq [computer01:106117] mca: base: components_open: component seq open function successful [computer01:106117] mca:rmaps:select: checking available component ppr [computer01:106117] mca:rmaps:select: Querying component [ppr] [computer01:106117] mca:rmaps:select: checking available component rank_file [computer01:106117] mca:rmaps:select: Querying component [rank_file] [computer01:106117] mca:rmaps:select: checking available component round_robin [computer01:106117] mca:rmaps:select: Querying component [round_robin] [computer01:106117] mca:rmaps:select: checking available component seq [computer01:106117] mca:rmaps:select: Querying component [seq] [computer01:106117] [prterun-computer01-106117@0,0]: Final mapper priorities [computer01:106117] Mapper: ppr Priority: 90 [computer01:106117] Mapper: seq Priority: 60 [computer01:106117] Mapper: round_robin Priority: 10 [computer01:106117] Mapper: rank_file Priority: 0 [computer01:106117] mca:rmaps: mapping job prterun-computer01-106117@1 [computer01:106117] mca:rmaps: setting mapping policies for job prterun-computer01-106117@1 inherit TRUE hwtcpus FALSE [9/1957] [computer01:106117] mca:rmaps[358] mapping not given - using bycore [computer01:106117] setdefaultbinding[365] binding not given - using bycore [computer01:106117] mca:rmaps:ppr: job prterun-computer01-106117@1 not using ppr mapper PPR NULL policy PPR NOTSET [computer01:106117] mca:rmaps:seq: job prterun-computer01-106117@1 not using seq mapper [computer01:106117] mca:rmaps:rr: mapping job prterun-computer01-106117@1 [computer01:106117] AVAILABLE NODES FOR MAPPING: [computer01:106117] node: computer01 daemon: 0 slots_available: 1 [computer01:106117] mca:rmaps:rr: mapping by Core for job prterun-computer01-106117@1 slots 1 num_procs 2 ________________________________ There are not enough slots available in the system to satisfy the 2 slots that were requested by the application: which Either request fewer procs for your application, or make more slots available for use. A ?slot? is the PRRTE term for an allocatable unit where we can launch a process. The number of slots available are defined by the environment in which PRRTE processes are run: 1. Hostfile, via ?slots=N? clauses (N defaults to number of processor cores if not provided) 2. The ?host command line parameter, via a ?:N? suffix on the hostname (N defaults to 1 if not provided) 3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.) 4. If none of a hostfile, the ?host command line parameter, or an RM is present, PRRTE defaults to the number of processor cores In all the above cases, if you want PRRTE to default to the number of hardware threads instead of the number of processor cores, use the ?use-hwthread-cpus option. Alternatively, you can use the ?map-by :OVERSUBSCRIBE option to ignore the number of available slots when deciding the number of processes to launch. ________________________________ ? 2022/11/8 05:46, Jeff Squyres (jsquyres) ??: In the future, can you please just mail one of the lists? This particular question is probably more of a users type of question (since we're not talking about the internals of Open MPI itself), so I'll reply just on the users list. For what it's worth, I'm unable to replicate your error: $ mpirun --version mpirun (Open MPI) 5.0.0rc9 Report bugs to https://www.open-mpi.org/community/help/ $ cat hostfile mpi002 slots=1 mpi005 slots=1 $ mpirun -n 2 --machinefile hostfile hostname mpi002 mpi005 Can you try running with "--mca rmaps_base_verbose 100" so that we can get some debugging output and see why the slots aren't working for you? Show the full output, like I did above (e.g., cat the hostfile, and then mpirun with the MCA param and all the output). Thanks! -- Jeff Squyres jsquy...@cisco.com<mailto:jsquy...@cisco.com><mailto:jsquy...@cisco.com><mailto:jsquy...@cisco.com> ________________________________ From: devel <devel-boun...@lists.open-mpi.org><mailto:devel-boun...@lists.open-mpi.org><mailto:devel-boun...@lists.open-mpi.org><mailto:devel-boun...@lists.open-mpi.org> on behalf of mrlong via devel <de...@lists.open-mpi.org><mailto:de...@lists.open-mpi.org><mailto:de...@lists.open-mpi.org><mailto:de...@lists.open-mpi.org> Sent: Monday, November 7, 2022 3:37 AM To: de...@lists.open-mpi.org<mailto:de...@lists.open-mpi.org><mailto:de...@lists.open-mpi.org><mailto:de...@lists.open-mpi.org> <de...@lists.open-mpi.org><mailto:de...@lists.open-mpi.org><mailto:de...@lists.open-mpi.org><mailto:de...@lists.open-mpi.org>; Open MPI Users <users@lists.open-mpi.org><mailto:users@lists.open-mpi.org><mailto:users@lists.open-mpi.org><mailto:users@lists.open-mpi.org> Cc: mrlong <mrlong...@gmail.com><mailto:mrlong...@gmail.com><mailto:mrlong...@gmail.com><mailto:mrlong...@gmail.com> Subject: [OMPI devel] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application Two machines, each with 64 cores. The contents of the hosts file are: 192.168.180.48 slots=1 192.168.60.203 slots=1 Why do you get the following error when running with openmpi 5.0.0rc9? (py3.9) [user@machine01 share]0.5692263713929891nbsp; mpirun -n 2 --machinefile hosts hostname -------------------------------------------------------------------------- There are not enough slots available in the system to satisfy the 2 slots that were requested by the application: hostname Either request fewer procs for your application, or make more slots available for use. A "slot" is the PRRTE term for an allocatable unit where we can launch a process. The number of slots available are defined by the environment in which PRRTE processes are run: 1. Hostfile, via "slots=N" clauses (N defaults to number of processor cores if not provided) 2. The --host command line parameter, via a ":N" suffix on the hostname (N defaults to 1 if not provided) 3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.) 4. If none of a hostfile, the --host command line parameter, or an RM is present, PRRTE defaults to the number of processor cores In all the above cases, if you want PRRTE to default to the number of hardware threads instead of the number of processor cores, use the --use-hwthread-cpus option. Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the number of available slots when deciding the number of processes to launch. -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.open-mpi.org/mailman/private/users/attachments/20221114/2c75fc85/attachment.html><https://lists.open-mpi.org/mailman/private/users/attachments/20221114/2c75fc85/attachment.html> ------------------------------ Message: 2 Date: Mon, 14 Nov 2022 18:04:06 +0000 From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com><mailto:jsquy...@cisco.com> To: "users@lists.open-mpi.org"<mailto:users@lists.open-mpi.org> <users@lists.open-mpi.org><mailto:users@lists.open-mpi.org> Cc: arun c <arun.edar...@gmail.com><mailto:arun.edar...@gmail.com> Subject: Re: [OMPI users] Tracing of openmpi internal functions Message-ID: <bl0pr11mb2980b144bc115f202701558dc0...@bl0pr11mb2980.namprd11.prod.outlook.com><mailto:bl0pr11mb2980b144bc115f202701558dc0...@bl0pr11mb2980.namprd11.prod.outlook.com> Content-Type: text/plain; charset="us-ascii" Open MPI uses plug-in modules for its implementations of the MPI collective algorithms. From that perspective, once you understand that infrastructure, it's exactly the same regardless of whether the MPI job is using intra-node or inter-node collectives. We don't have much in the way of detailed internal function call tracing inside Open MPI itself, due to performance considerations. You might want to look into flamegraphs, or something similar...? -- Jeff Squyres jsquy...@cisco.com<mailto:jsquy...@cisco.com> ________________________________ From: users <users-boun...@lists.open-mpi.org><mailto:users-boun...@lists.open-mpi.org> on behalf of arun c via users <users@lists.open-mpi.org><mailto:users@lists.open-mpi.org> Sent: Saturday, November 12, 2022 9:46 AM To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> <users@lists.open-mpi.org><mailto:users@lists.open-mpi.org> Cc: arun c <arun.edar...@gmail.com><mailto:arun.edar...@gmail.com> Subject: [OMPI users] Tracing of openmpi internal functions Hi All, I am new to openmpi and trying to learn the internals (source code level) of data transfer during collective operations. At first, I will limit it to intra-node (between cpu cores, and sockets) to minimize the scope of learning. What are the best options (Looking for only free and open methods) for tracing the openmpi code? (say I want to execute alltoall collective and trace all the function calls and event callbacks that happened inside the libmpi.so on all the cores) Linux kernel has something called ftrace, it gives a neat call graph of all the internal functions inside the kernel with time, is something similar available? --Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.open-mpi.org/mailman/private/users/attachments/20221114/0c9d0e69/attachment.html><https://lists.open-mpi.org/mailman/private/users/attachments/20221114/0c9d0e69/attachment.html> ------------------------------ Subject: Digest Footer _______________________________________________ users mailing list users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> https://lists.open-mpi.org/mailman/listinfo/users ------------------------------ End of users Digest, Vol 4818, Issue 1 **************************************