Ok, this looks like the same type of output running ring_c as your Python MPI app -- good. Using a C MPI program for testing just eliminates some possible variables / issues.
Ok, let's try running again, but add some more command line parameters: mpirun -n 2 --machinefile hosts --mca plm_base_verbose 100 --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 --prtemca grpcomm_base_verbose 5 --prtemca state_base_verbose 5 ./ring_c And please send the output back here to the list. -- Jeff Squyres jsquy...@cisco.com ________________________________ From: timesir <mrlong...@gmail.com> Sent: Tuesday, November 29, 2022 9:44 PM To: Jeff Squyres (jsquyres) <jsquy...@cisco.com> Subject: Re: mpi program gets stuck Do you think the information below is enough? If not, I will add more (py3.9) ➜ /share cat hosts 192.168.180.48 slots=1 192.168.60.203 slots=1 (py3.9) ➜ examples mpirun -n 2 --machinefile hosts --mca plm_base_verbose 100 --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 ./ring_c [computer01:74388] mca: base: component_find: searching NULL for plm components [computer01:74388] mca: base: find_dyn_components: checking NULL for plm components [computer01:74388] pmix:mca: base: components_register: registering framework plm components [computer01:74388] pmix:mca: base: components_register: found loaded component slurm [computer01:74388] pmix:mca: base: components_register: component slurm register function successful [computer01:74388] pmix:mca: base: components_register: found loaded component ssh [computer01:74388] pmix:mca: base: components_register: component ssh register function successful [computer01:74388] mca: base: components_open: opening plm components [computer01:74388] mca: base: components_open: found loaded component slurm [computer01:74388] mca: base: components_open: component slurm open function successful [computer01:74388] mca: base: components_open: found loaded component ssh [computer01:74388] mca: base: components_open: component ssh open function successful [computer01:74388] mca:base:select: Auto-selecting plm components [computer01:74388] mca:base:select:( plm) Querying component [slurm] [computer01:74388] mca:base:select:( plm) Querying component [ssh] [computer01:74388] [[INVALID],0] plm:ssh_lookup on agent ssh : rsh path NULL [computer01:74388] mca:base:select:( plm) Query of component [ssh] set priority to 10 [computer01:74388] mca:base:select:( plm) Selected component [ssh] [computer01:74388] mca: base: close: component slurm closed [computer01:74388] mca: base: close: unloading component slurm [computer01:74388] [prterun-computer01-74388@0,0] plm:ssh_setup on agent ssh : rsh path NULL [computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive start comm [computer01:74388] mca: base: component_find: searching NULL for ras components [computer01:74388] mca: base: find_dyn_components: checking NULL for ras components [computer01:74388] pmix:mca: base: components_register: registering framework ras components [computer01:74388] pmix:mca: base: components_register: found loaded component simulator [computer01:74388] pmix:mca: base: components_register: component simulator register function successful [computer01:74388] pmix:mca: base: components_register: found loaded component pbs [computer01:74388] pmix:mca: base: components_register: component pbs register function successful [computer01:74388] pmix:mca: base: components_register: found loaded component slurm [computer01:74388] pmix:mca: base: components_register: component slurm register function successful [computer01:74388] mca: base: components_open: opening ras components [computer01:74388] mca: base: components_open: found loaded component simulator [computer01:74388] mca: base: components_open: found loaded component pbs [computer01:74388] mca: base: components_open: component pbs open function successful [computer01:74388] mca: base: components_open: found loaded component slurm [computer01:74388] mca: base: components_open: component slurm open function successful [computer01:74388] mca:base:select: Auto-selecting ras components [computer01:74388] mca:base:select:( ras) Querying component [simulator] [computer01:74388] mca:base:select:( ras) Querying component [pbs] [computer01:74388] mca:base:select:( ras) Querying component [slurm] [computer01:74388] mca:base:select:( ras) No component selected! [computer01:74388] mca: base: component_find: searching NULL for rmaps components [computer01:74388] mca: base: find_dyn_components: checking NULL for rmaps components [computer01:74388] pmix:mca: base: components_register: registering framework rmaps components [computer01:74388] pmix:mca: base: components_register: found loaded component ppr [computer01:74388] pmix:mca: base: components_register: component ppr register function successful [computer01:74388] pmix:mca: base: components_register: found loaded component rank_file [computer01:74388] pmix:mca: base: components_register: component rank_file has no register or open function [computer01:74388] pmix:mca: base: components_register: found loaded component round_robin [computer01:74388] pmix:mca: base: components_register: component round_robin register function successful [computer01:74388] pmix:mca: base: components_register: found loaded component seq [computer01:74388] pmix:mca: base: components_register: component seq register function successful [computer01:74388] mca: base: components_open: opening rmaps components [computer01:74388] mca: base: components_open: found loaded component ppr [computer01:74388] mca: base: components_open: component ppr open function successful [computer01:74388] mca: base: components_open: found loaded component rank_file [computer01:74388] mca: base: components_open: found loaded component round_robin [computer01:74388] mca: base: components_open: component round_robin open function successful [computer01:74388] mca: base: components_open: found loaded component seq [computer01:74388] mca: base: components_open: component seq open function successful [computer01:74388] mca:rmaps:select: checking available component ppr [computer01:74388] mca:rmaps:select: Querying component [ppr] [computer01:74388] mca:rmaps:select: checking available component rank_file [computer01:74388] mca:rmaps:select: Querying component [rank_file] [computer01:74388] mca:rmaps:select: checking available component round_robin [computer01:74388] mca:rmaps:select: Querying component [round_robin] [computer01:74388] mca:rmaps:select: checking available component seq [computer01:74388] mca:rmaps:select: Querying component [seq] [computer01:74388] [prterun-computer01-74388@0,0]: Final mapper priorities [computer01:74388] Mapper: rank_file Priority: 100 [computer01:74388] Mapper: ppr Priority: 90 [computer01:74388] Mapper: seq Priority: 60 [computer01:74388] Mapper: round_robin Priority: 10 [computer01:74388] [prterun-computer01-74388@0,0] ras:base:allocate [computer01:74388] [prterun-computer01-74388@0,0] ras:base:allocate nothing found in module - proceeding to hostfile [computer01:74388] [prterun-computer01-74388@0,0] ras:base:allocate adding hostfile hosts [computer01:74388] [prterun-computer01-74388@0,0] hostfile: checking hostfile hosts for nodes [computer01:74388] [prterun-computer01-74388@0,0] hostfile: node 192.168.180.48 is being included - keep all is FALSE [computer01:74388] [prterun-computer01-74388@0,0] hostfile: node 192.168.60.203 is being included - keep all is FALSE [computer01:74388] [prterun-computer01-74388@0,0] hostfile: adding node 192.168.180.48 slots 1 [computer01:74388] [prterun-computer01-74388@0,0] hostfile: adding node 192.168.60.203 slots 1 [computer01:74388] [prterun-computer01-74388@0,0] ras:base:node_insert inserting 2 nodes [computer01:74388] [prterun-computer01-74388@0,0] ras:base:node_insert updating HNP [192.168.180.48] info to 1 slots [computer01:74388] [prterun-computer01-74388@0,0] ras:base:node_insert node 192.168.60.203 slots 1 ====================== ALLOCATED NODES ====================== computer01: slots=1 max_slots=0 slots_inuse=0 state=UP Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN aliases: 192.168.180.48 192.168.60.203: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN Flags: SLOTS_GIVEN aliases: NONE ================================================================= [computer01:74388] [prterun-computer01-74388@0,0] plm:base:setup_vm [computer01:74388] [prterun-computer01-74388@0,0] plm:base:setup_vm creating map [computer01:74388] [prterun-computer01-74388@0,0] setup:vm: working unmanaged allocation [computer01:74388] [prterun-computer01-74388@0,0] using hostfile hosts [computer01:74388] [prterun-computer01-74388@0,0] hostfile: checking hostfile hosts for nodes [computer01:74388] [prterun-computer01-74388@0,0] hostfile: node 192.168.180.48 is being included - keep all is FALSE [computer01:74388] [prterun-computer01-74388@0,0] hostfile: node 192.168.60.203 is being included - keep all is FALSE [computer01:74388] [prterun-computer01-74388@0,0] hostfile: adding node 192.168.180.48 slots 1 [computer01:74388] [prterun-computer01-74388@0,0] hostfile: adding node 192.168.60.203 slots 1 [computer01:74388] [prterun-computer01-74388@0,0] checking node 192.168.180.48 [computer01:74388] [prterun-computer01-74388@0,0] ignoring myself [computer01:74388] [prterun-computer01-74388@0,0] checking node 192.168.60.203 [computer01:74388] [prterun-computer01-74388@0,0] plm:base:setup_vm add new daemon [prterun-computer01-74388@0,1] [computer01:74388] [prterun-computer01-74388@0,0] plm:base:setup_vm assigning new daemon [prterun-computer01-74388@0,1] to node 192.168.60.203 [computer01:74388] [prterun-computer01-74388@0,0] plm:ssh: launching vm [computer01:74388] [prterun-computer01-74388@0,0] plm:ssh: local shell: 0 (bash) [computer01:74388] [prterun-computer01-74388@0,0] plm:ssh: assuming same remote shell as local shell [computer01:74388] [prterun-computer01-74388@0,0] plm:ssh: remote shell: 0 (bash) [computer01:74388] [prterun-computer01-74388@0,0] plm:ssh: final template argv: /usr/bin/ssh <template> PRTE_PREFIX=/usr/local/openmpi;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-computer01-74388@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prterun-computer01-74388@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59616:24,16,24,24,24,24"<mailto:prterun-computer01-74388@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59616:24,16,24,24,24,24> --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca ras_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-computer01-74388@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59616:24,16,24,24,24,24"<mailto:prterun-computer01-74388@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59616:24,16,24,24,24,24> [computer01:74388] [prterun-computer01-74388@0,0] plm:ssh:launch daemon 0 not a child of mine [computer01:74388] [prterun-computer01-74388@0,0] plm:ssh: adding node 192.168.60.203 to launch list [computer01:74388] [prterun-computer01-74388@0,0] plm:ssh: activating launch event [computer01:74388] [prterun-computer01-74388@0,0] plm:ssh: recording launch of daemon [prterun-computer01-74388@0,1] [computer01:74388] [prterun-computer01-74388@0,0] plm:ssh: executing: (/usr/bin/ssh) [/usr/bin/ssh 192.168.60.203 PRTE_PREFIX=/usr/local/openmpi;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-computer01-74388@0" --prtemca ess_base_vpid 1 --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prterun-computer01-74388@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59616:24,16,24,24,24,24"<mailto:prterun-computer01-74388@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59616:24,16,24,24,24,24> --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca ras_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-computer01-74388@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59616:24,16,24,24,24,24"<mailto:prterun-computer01-74388@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59616:24,16,24,24,24,24>] [computer01:74388] [prterun-computer01-74388@0,0] plm:base:orted_report_launch from daemon [prterun-computer01-74388@0,1] [computer01:74388] [prterun-computer01-74388@0,0] plm:base:orted_report_launch from daemon [prterun-computer01-74388@0,1] on node computer02 [computer01:74388] ALIASES FOR NODE computer02 (computer02) [computer01:74388] ALIAS: 192.168.60.203 [computer01:74388] ALIAS: computer02 [computer01:74388] ALIAS: 172.17.180.203 [computer01:74388] ALIAS: 172.168.10.23 [computer01:74388] ALIAS: 172.168.10.143 [computer01:74388] [prterun-computer01-74388@0,0] RECEIVED TOPOLOGY SIG 2N:2S:2L3:64L2:64L1:64C:128H:0-127::x86_64:le FROM NODE computer02 [computer01:74388] [prterun-computer01-74388@0,0] NEW TOPOLOGY - ADDING SIGNATURE [computer01:74388] [prterun-computer01-74388@0,0] plm:base:orted_report_launch completed for daemon [prterun-computer01-74388@0,1] at contact prterun-computer01-74388@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59616:24,16,24,24,24,24<mailto:prterun-computer01-74388@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59616:24,16,24,24,24,24> [computer01:74388] [prterun-computer01-74388@0,0] plm:base:orted_report_launch job prterun-computer01-74388@0 recvd 2 of 2 reported daemons [computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive processing msg [computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive job launch command from [prterun-computer01-74388@0,0] [computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive adding hosts ====================== ALLOCATED NODES ====================== computer01: slots=1 max_slots=0 slots_inuse=0 state=UP Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN aliases: 192.168.180.48 computer02: slots=1 max_slots=0 slots_inuse=0 state=UP Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN aliases: 192.168.60.203,computer02,172.17.180.203,172.168.10.23,172.168.10.143 ================================================================= [computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive calling spawn [computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive done processing commands [computer01:74388] [prterun-computer01-74388@0,0] plm:base:setup_job [computer01:74388] [prterun-computer01-74388@0,0] ras:base:allocate [computer01:74388] [prterun-computer01-74388@0,0] ras:base:allocate allocation already read [computer01:74388] [prterun-computer01-74388@0,0] plm:base:setup_vm [computer01:74388] [prterun-computer01-74388@0,0] plm_base:setup_vm NODE computer02 WAS NOT ADDED [computer01:74388] [prterun-computer01-74388@0,0] plm:base:setup_vm no new daemons required [computer01:74388] mca:rmaps: mapping job prterun-computer01-74388@1 [computer01:74388] mca:rmaps: setting mapping policies for job prterun-computer01-74388@1 inherit TRUE hwtcpus FALSE [computer01:74388] mca:rmaps[355] mapping not given - using bycore [computer01:74388] setdefaultbinding[314] binding not given - using bycore [computer01:74388] mca:rmaps:rf: job prterun-computer01-74388@1 not using rankfile policy [computer01:74388] mca:rmaps:ppr: job prterun-computer01-74388@1 not using ppr mapper PPR NULL policy PPR NOTSET [computer01:74388] [prterun-computer01-74388@0,0] rmaps:seq called on job prterun-computer01-74388@1 [computer01:74388] mca:rmaps:seq: job prterun-computer01-74388@1 not using seq mapper [computer01:74388] mca:rmaps:rr: mapping job prterun-computer01-74388@1 [computer01:74388] [prterun-computer01-74388@0,0] using hostfile hosts [computer01:74388] [prterun-computer01-74388@0,0] hostfile: checking hostfile hosts for nodes [computer01:74388] [prterun-computer01-74388@0,0] hostfile: node 192.168.180.48 is being included - keep all is FALSE [computer01:74388] [prterun-computer01-74388@0,0] hostfile: node 192.168.60.203 is being included - keep all is FALSE [computer01:74388] [prterun-computer01-74388@0,0] hostfile: adding node 192.168.180.48 slots 1 [computer01:74388] [prterun-computer01-74388@0,0] hostfile: adding node 192.168.60.203 slots 1 [computer01:74388] NODE computer01 DOESNT MATCH NODE 192.168.60.203 [computer01:74388] [prterun-computer01-74388@0,0] node computer01 has 1 slots available [computer01:74388] [prterun-computer01-74388@0,0] node computer02 has 1 slots available [computer01:74388] AVAILABLE NODES FOR MAPPING: [computer01:74388] node: computer01 daemon: 0 slots_available: 1 [computer01:74388] node: computer02 daemon: 1 slots_available: 1 [computer01:74388] mca:rmaps:rr: mapping by Core for job prterun-computer01-74388@1 slots 2 num_procs 2 [computer01:74388] mca:rmaps:rr: found 56 Core objects on node computer01 [computer01:74388] mca:rmaps:rr: assigning nprocs 1 [computer01:74388] mca:rmaps:rr: assigning proc to object 0 [computer01:74388] [prterun-computer01-74388@0,0] get_avail_ncpus: node computer01 has 0 procs on it [computer01:74388] mca:rmaps: compute bindings for job prterun-computer01-74388@1 with policy CORE:IF-SUPPORTED[1007] [computer01:74388] mca:rmaps: bind [prterun-computer01-74388@1,INVALID] with policy CORE:IF-SUPPORTED [computer01:74388] [prterun-computer01-74388@0,0] BOUND PROC [prterun-computer01-74388@1,INVALID][computer01] TO package[0][core:0] [computer01:74388] mca:rmaps:rr: found 64 Core objects on node computer02 [computer01:74388] mca:rmaps:rr: assigning nprocs 1 [computer01:74388] mca:rmaps:rr: assigning proc to object 0 [computer01:74388] [prterun-computer01-74388@0,0] get_avail_ncpus: node computer02 has 0 procs on it [computer01:74388] mca:rmaps: compute bindings for job prterun-computer01-74388@1 with policy CORE:IF-SUPPORTED[1007] [computer01:74388] mca:rmaps: bind [prterun-computer01-74388@1,INVALID] with policy CORE:IF-SUPPORTED [computer01:74388] [prterun-computer01-74388@0,0] BOUND PROC [prterun-computer01-74388@1,INVALID][computer02] TO package[0][core:0] [computer01:74388] [prterun-computer01-74388@0,0] complete_setup on job prterun-computer01-74388@1 [computer01:74388] [prterun-computer01-74388@0,0] plm:base:launch_apps for job prterun-computer01-74388@1 [computer01:74388] [prterun-computer01-74388@0,0] plm:base:send launch msg for job prterun-computer01-74388@1 [computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive processing msg [computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive local launch complete command from [prterun-computer01-74388@0,1] [computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive got local launch complete for job prterun-computer01-74388@1 [computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive got local launch complete for vpid 1 [computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive got local launch complete for vpid 1 state RUNNING [computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive done processing commands [computer01:74388] [prterun-computer01-74388@0,0] plm:base:launch wiring up iof for job prterun-computer01-74388@1 [computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive processing msg [computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive registered command from [prterun-computer01-74388@0,1] [computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive got registered for job prterun-computer01-74388@1 [computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive got registered for vpid 1 [computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive done processing commands [computer01:74388] [prterun-computer01-74388@0,0] plm:base:launch prterun-computer01-74388@1 registered [computer01:74388] [prterun-computer01-74388@0,0] plm:base:prted_cmd sending prted_exit commands Abort is in progress...hit ctrl-c again to forcibly terminate 在 2022/11/30 00:08, Jeff Squyres (jsquyres) 写道: (we've conversed a bit off-list; bringing this back to the list with a good subject to differentiate it from other digest threads) I'm glad the tarball I provided (that included the PMIx fix) resolved running "uptime" for you. Can you try running a plain C MPI program instead of a Python MPI program? That would just eliminate a few more variables from the troubleshooting process. In the "examples" directory in the tarball I provided are trivial "hello world" and "ring" MPI programs. A "make" should build them all. Try running hello_c and ring_c. -- Jeff Squyres jsquy...@cisco.com<mailto:jsquy...@cisco.com> ________________________________ From: timesir <mrlong...@gmail.com><mailto:mrlong...@gmail.com> Sent: Tuesday, November 29, 2022 10:42 AM To: Jeff Squyres (jsquyres) <jsquy...@cisco.com><mailto:jsquy...@cisco.com>; Open MPI Users <users@lists.open-mpi.org><mailto:users@lists.open-mpi.org> Subject: mpi program gets stuck see also: https://pastebin.com/s5tjaUkF (py3.9) ➜ /share cat hosts 192.168.180.48 slots=1 192.168.60.203 slots=1 1. This command now runs correctly using your openmpi-gitclone-pr11096.tar.bz2 (py3.9) ➜ /share mpirun -n 2 --machinefile hosts --mca plm_base_verbose 100 --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 uptime 2. But this command gets stuck. It seems to be the mpi program that gets stuck. test.py: import mpi4py from mpi4py import MPI (py3.9) ➜ /share mpirun -n 2 --machinefile hosts --mca plm_base_verbose 100 --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 python test.py [computer01:47982] mca: base: component_find: searching NULL for plm components [computer01:47982] mca: base: find_dyn_components: checking NULL for plm components [computer01:47982] pmix:mca: base: components_register: registering framework plm components [computer01:47982] pmix:mca: base: components_register: found loaded component slurm [computer01:47982] pmix:mca: base: components_register: component slurm register function successful [computer01:47982] pmix:mca: base: components_register: found loaded component ssh [computer01:47982] pmix:mca: base: components_register: component ssh register function successful [computer01:47982] mca: base: components_open: opening plm components [computer01:47982] mca: base: components_open: found loaded component slurm [computer01:47982] mca: base: components_open: component slurm open function successful [computer01:47982] mca: base: components_open: found loaded component ssh [computer01:47982] mca: base: components_open: component ssh open function successful [computer01:47982] mca:base:select: Auto-selecting plm components [computer01:47982] mca:base:select:( plm) Querying component [slurm] [computer01:47982] mca:base:select:( plm) Querying component [ssh] [computer01:47982] [[INVALID],0] plm:ssh_lookup on agent ssh : rsh path NULL [computer01:47982] mca:base:select:( plm) Query of component [ssh] set priority to 10 [computer01:47982] mca:base:select:( plm) Selected component [ssh] [computer01:47982] mca: base: close: component slurm closed [computer01:47982] mca: base: close: unloading component slurm [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh_setup on agent ssh : rsh path NULL [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive start comm [computer01:47982] mca: base: component_find: searching NULL for ras components [computer01:47982] mca: base: find_dyn_components: checking NULL for ras components [computer01:47982] pmix:mca: base: components_register: registering framework ras components [computer01:47982] pmix:mca: base: components_register: found loaded component simulator [computer01:47982] pmix:mca: base: components_register: component simulator register function successful [computer01:47982] pmix:mca: base: components_register: found loaded component pbs [computer01:47982] pmix:mca: base: components_register: component pbs register function successful [computer01:47982] pmix:mca: base: components_register: found loaded component slurm [computer01:47982] pmix:mca: base: components_register: component slurm register function successful [computer01:47982] mca: base: components_open: opening ras components [computer01:47982] mca: base: components_open: found loaded component simulator [computer01:47982] mca: base: components_open: found loaded component pbs [computer01:47982] mca: base: components_open: component pbs open function successful [computer01:47982] mca: base: components_open: found loaded component slurm [computer01:47982] mca: base: components_open: component slurm open function successful [computer01:47982] mca:base:select: Auto-selecting ras components [computer01:47982] mca:base:select:( ras) Querying component [simulator] [computer01:47982] mca:base:select:( ras) Querying component [pbs] [computer01:47982] mca:base:select:( ras) Querying component [slurm] [computer01:47982] mca:base:select:( ras) No component selected! [computer01:47982] mca: base: component_find: searching NULL for rmaps components [computer01:47982] mca: base: find_dyn_components: checking NULL for rmaps components [computer01:47982] pmix:mca: base: components_register: registering framework rmaps components [computer01:47982] pmix:mca: base: components_register: found loaded component ppr [computer01:47982] pmix:mca: base: components_register: component ppr register function successful [computer01:47982] pmix:mca: base: components_register: found loaded component rank_file [computer01:47982] pmix:mca: base: components_register: component rank_file has no register or open function [computer01:47982] pmix:mca: base: components_register: found loaded component round_robin [computer01:47982] pmix:mca: base: components_register: component round_robin register function successful [computer01:47982] pmix:mca: base: components_register: found loaded component seq [computer01:47982] pmix:mca: base: components_register: component seq register function successful [computer01:47982] mca: base: components_open: opening rmaps components [computer01:47982] mca: base: components_open: found loaded component ppr [computer01:47982] mca: base: components_open: component ppr open function successful [computer01:47982] mca: base: components_open: found loaded component rank_file [computer01:47982] mca: base: components_open: found loaded component round_robin [computer01:47982] mca: base: components_open: component round_robin open function successful [computer01:47982] mca: base: components_open: found loaded component seq [computer01:47982] mca: base: components_open: component seq open function successful [computer01:47982] mca:rmaps:select: checking available component ppr [computer01:47982] mca:rmaps:select: Querying component [ppr] [computer01:47982] mca:rmaps:select: checking available component rank_file [computer01:47982] mca:rmaps:select: Querying component [rank_file] [computer01:47982] mca:rmaps:select: checking available component round_robin [computer01:47982] mca:rmaps:select: Querying component [round_robin] [computer01:47982] mca:rmaps:select: checking available component seq [computer01:47982] mca:rmaps:select: Querying component [seq] [computer01:47982] [prterun-computer01-47982@0,0]: Final mapper priorities [computer01:47982] Mapper: rank_file Priority: 100 [computer01:47982] Mapper: ppr Priority: 90 [computer01:47982] Mapper: seq Priority: 60 [computer01:47982] Mapper: round_robin Priority: 10 [computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate [computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate nothing found in module - proceeding to hostfile [computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate adding hostfile hosts [computer01:47982] [prterun-computer01-47982@0,0] hostfile: checking hostfile hosts for nodes [computer01:47982] [prterun-computer01-47982@0,0] hostfile: node 192.168.180.48 is being included - keep all is FALSE [computer01:47982] [prterun-computer01-47982@0,0] hostfile: node 192.168.60.203 is being included - keep all is FALSE [computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node 192.168.180.48 slots 1 [computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node 192.168.60.203 slots 1 [computer01:47982] [prterun-computer01-47982@0,0] ras:base:node_insert inserting 2 nodes [computer01:47982] [prterun-computer01-47982@0,0] ras:base:node_insert updating HNP [192.168.180.48] info to 1 slots [computer01:47982] [prterun-computer01-47982@0,0] ras:base:node_insert node 192.168.60.203 slots 1 ====================== ALLOCATED NODES ====================== computer01: slots=1 max_slots=0 slots_inuse=0 state=UP Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN aliases: 192.168.180.48 192.168.60.203<http://192.168.60.203>: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN Flags: SLOTS_GIVEN aliases: NONE ================================================================= [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm creating map [computer01:47982] [prterun-computer01-47982@0,0] setup:vm: working unmanaged allocation [computer01:47982] [prterun-computer01-47982@0,0] using hostfile hosts [computer01:47982] [prterun-computer01-47982@0,0] hostfile: checking hostfile hosts for nodes [computer01:47982] [prterun-computer01-47982@0,0] hostfile: node 192.168.180.48 is being included - keep all is FALSE [computer01:47982] [prterun-computer01-47982@0,0] hostfile: node 192.168.60.203 is being included - keep all is FALSE [computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node 192.168.180.48 slots 1 [computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node 192.168.60.203 slots 1 [computer01:47982] [prterun-computer01-47982@0,0] checking node 192.168.180.48 [computer01:47982] [prterun-computer01-47982@0,0] ignoring myself [computer01:47982] [prterun-computer01-47982@0,0] checking node 192.168.60.203 [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm add new daemon [prterun-computer01-47982@0,1] [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm assigning new daemon [prterun-computer01-47982@0,1] to node 192.168.60.203 [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: launching vm [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: local shell: 0 (bash) [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: assuming same remote shell as local shell [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: remote shell: 0 (bash) [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: final template argv: /usr/bin/ssh <template> PRTE_PREFIX=/usr/local/openmpi;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-computer01-47982@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24"<mailto:prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24> --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca ras_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24"<mailto:prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24> [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh:launch daemon 0 not a child of mine [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: adding node 192.168.60.203 to launch list [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: activating launch event [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: recording launch of daemon [prterun-computer01-47982@0,1] [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: executing: (/usr/bin/ssh) [/usr/bin/ssh 192.168.60.203 PRTE_PREFIX=/usr/local/openmpi;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-computer01-47982@0" --prtemca ess_base_vpid 1 --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24"<mailto:prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24> --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca ras_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24"<mailto:prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24>] [computer01:47982] [prterun-computer01-47982@0,0] plm:base:orted_report_launch from daemon [prterun-computer01-47982@0,1] [computer01:47982] [prterun-computer01-47982@0,0] plm:base:orted_report_launch from daemon [prterun-computer01-47982@0,1] on node computer02 [computer01:47982] ALIASES FOR NODE computer02 (computer02) [computer01:47982] ALIAS: 192.168.60.203 [computer01:47982] ALIAS: computer02 [computer01:47982] ALIAS: 172.17.180.203 [computer01:47982] ALIAS: 172.168.10.23 [computer01:47982] ALIAS: 172.168.10.143 [computer01:47982] [prterun-computer01-47982@0,0] RECEIVED TOPOLOGY SIG 2N:2S:2L3:64L2:64L1:64C:128H:0-127::x86_64:le FROM NODE computer02 [computer01:47982] [prterun-computer01-47982@0,0] NEW TOPOLOGY - ADDING SIGNATURE [computer01:47982] [prterun-computer01-47982@0,0] plm:base:orted_report_launch completed for daemon [prterun-computer01-47982@0,1] at contact prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24<mailto:prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24> [computer01:47982] [prterun-computer01-47982@0,0] plm:base:orted_report_launch job prterun-computer01-47982@0 recvd 2 of 2 reported daemons [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive processing msg [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive job launch command from [prterun-computer01-47982@0,0] [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive adding hosts ====================== ALLOCATED NODES ====================== computer01: slots=1 max_slots=0 slots_inuse=0 state=UP Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN aliases: 192.168.180.48 computer02: slots=1 max_slots=0 slots_inuse=0 state=UP Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN aliases: 192.168.60.203,computer02,172.17.180.203,172.168.10.23,172.168.10.143 ================================================================= [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive calling spawn [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive done processing commands [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_job [computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate [computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate allocation already read [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm [computer01:47982] [prterun-computer01-47982@0,0] plm_base:setup_vm NODE computer02 WAS NOT ADDED [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm no new daemons required [computer01:47982] mca:rmaps: mapping job prterun-computer01-47982@1 [computer01:47982] mca:rmaps: setting mapping policies for job prterun-computer01-47982@1 inherit TRUE hwtcpus FALSE [computer01:47982] mca:rmaps[355] mapping not given - using bycore [computer01:47982] setdefaultbinding[314] binding not given - using bycore [computer01:47982] mca:rmaps:rf: job prterun-computer01-47982@1 not using rankfile policy [computer01:47982] mca:rmaps:ppr: job prterun-computer01-47982@1 not using ppr mapper PPR NULL policy PPR NOTSET [computer01:47982] [prterun-computer01-47982@0,0] rmaps:seq called on job prterun-computer01-47982@1 [computer01:47982] mca:rmaps:seq: job prterun-computer01-47982@1 not using seq mapper [computer01:47982] mca:rmaps:rr: mapping job prterun-computer01-47982@1 [computer01:47982] [prterun-computer01-47982@0,0] using hostfile hosts [computer01:47982] [prterun-computer01-47982@0,0] hostfile: checking hostfile hosts for nodes [computer01:47982] [prterun-computer01-47982@0,0] hostfile: node 192.168.180.48 is being included - keep all is FALSE [computer01:47982] [prterun-computer01-47982@0,0] hostfile: node 192.168.60.203 is being included - keep all is FALSE [computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node 192.168.180.48 slots 1 [computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node 192.168.60.203 slots 1 [computer01:47982] NODE computer01 DOESNT MATCH NODE 192.168.60.203 [computer01:47982] [prterun-computer01-47982@0,0] node computer01 has 1 slots available [computer01:47982] [prterun-computer01-47982@0,0] node computer02 has 1 slots available [computer01:47982] AVAILABLE NODES FOR MAPPING: [computer01:47982] node: computer01 daemon: 0 slots_available: 1 [computer01:47982] node: computer02 daemon: 1 slots_available: 1 [computer01:47982] mca:rmaps:rr: mapping by Core for job prterun-computer01-47982@1 slots 2 num_procs 2 [computer01:47982] mca:rmaps:rr: found 56 Core objects on node computer01 [computer01:47982] mca:rmaps:rr: assigning nprocs 1 [computer01:47982] mca:rmaps:rr: assigning proc to object 0 [computer01:47982] [prterun-computer01-47982@0,0] get_avail_ncpus: node computer01 has 0 procs on it [computer01:47982] mca:rmaps: compute bindings for job prterun-computer01-47982@1 with policy CORE:IF-SUPPORTED[1007] [computer01:47982] mca:rmaps: bind [prterun-computer01-47982@1,INVALID] with policy CORE:IF-SUPPORTED [computer01:47982] [prterun-computer01-47982@0,0] BOUND PROC [prterun-computer01-47982@1,INVALID][computer01] TO package[0][core:0] [computer01:47982] mca:rmaps:rr: found 64 Core objects on node computer02 [computer01:47982] mca:rmaps:rr: assigning nprocs 1 [computer01:47982] mca:rmaps:rr: assigning proc to object 0 [computer01:47982] [prterun-computer01-47982@0,0] get_avail_ncpus: node computer02 has 0 procs on it [computer01:47982] mca:rmaps: compute bindings for job prterun-computer01-47982@1 with policy CORE:IF-SUPPORTED[1007] [computer01:47982] mca:rmaps: bind [prterun-computer01-47982@1,INVALID] with policy CORE:IF-SUPPORTED [computer01:47982] [prterun-computer01-47982@0,0] BOUND PROC [prterun-computer01-47982@1,INVALID][computer02] TO package[0][core:0] [computer01:47982] [prterun-computer01-47982@0,0] complete_setup on job prterun-computer01-47982@1 [computer01:47982] [prterun-computer01-47982@0,0] plm:base:launch_apps for job prterun-computer01-47982@1 [computer01:47982] [prterun-computer01-47982@0,0] plm:base:send launch msg for job prterun-computer01-47982@1 [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive processing msg [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive local launch complete command from [prterun-computer01-47982@0,1] [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got local launch complete for job prterun-computer01-47982@1 [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got local launch complete for vpid 1 [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got local launch complete for vpid 1 state RUNNING [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive done processing commands [computer01:47982] [prterun-computer01-47982@0,0] plm:base:launch wiring up iof for job prterun-computer01-47982@1 [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive processing msg [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive registered command from [prterun-computer01-47982@0,1] [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got registered for job prterun-computer01-47982@1 [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got registered for vpid 1 [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive done processing commands [computer01:47982] [prterun-computer01-47982@0,0] plm:base:launch prterun-computer01-47982@1 registered [computer01:47982] [prterun-computer01-47982@0,0] plm:base:prted_cmd sending prted_exit commands #### ctrl + c Abort is in progress...hit ctrl-c again to forcibly terminate