Agree with Ralph.  Your next step is to try what is suggested in the FAQ: run 
hello_c and ring_c.

They are in the examples/ directory in the source tarball.  Once Open MPI is 
installed (and things like "mpicc" can be found in your $PATH), you can just cd 
in there and run "make" to build them.



On Nov 13, 2019, at 8:58 PM, Ralph Castain via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:

Difficult to know what to say here. I have no idea what your program does after 
validating the license. Does it execute some kind of MPI collective operation? 
Does only one proc validate the license and all others just use it?

All I can tell from your output is that the procs all launched okay.
Ralph


On Sep 27, 2019, at 4:32 PM, Steven Hill via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:

Any assistance with this would be greatly appreciated. I’m running CENTOS 7 
with Open MPI 1.10.7 We are using a product called XFlow by 3ds. I have been 
going back and forth trying to figure out why my OpenMPI job pause when 
expanding across more than one machine.

I confirmed the OpenMPI environment variable paths to libraries and bin files 
are correct on all machines (Head Node and 3 Compute Nodes).
LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:
PATH=/usr/lib64/openmpi/bin:

I can run an MPI Job to display the host name.
mpirun -host srv-comp01,srv-comp02,srv-comp03 hostname
srv-comp02
srv-comp01
srv-comp03

If I run the command which normally pauses and I just identify the same 
hostname twice, it works fine
i.e. mpirun -npernode 2 -host srv-comp01, srv-comp02 {command}

At the suggestion of the vendor I tried I have tried “--mca btl tcp,self” the 
job still pauses at the same spot.

The firewall is turned off on all machines. Password-less SSH works without 
issue. I have tested with this another product we use called starccm (has it’s 
own MPI Provider).

I have not run hello_c or ring_c, I see them referenced in the FAQ “11. How can 
I diagnose problems when running across multiple hosts? “ I can’t see where to 
download them from.

Here is a verbose output of the command. It always pauses at “[ INFO  ] License 
validation OK” and goes no further. I am able to run the job without MPI on a 
single host. I’m not sure where to go from here.

[symapp@srv-comp-hn ~]$ mpirun --version
mpirun (Open MPI) 1.10.7

[symapp@srv-comp-hn ~]$ mpirun -npernode 1 --mca plm_base_verbose 10 -host 
srv-comp01,srv-comp02,srv-comp03 
/mntnfs/eng-nfs/Apps/XFlow/engine-3d-mpi-ompi10 
/mntnfs/eng-nfs/jsmith/XFlow/Periodic/PeriodicCavity_MPI3.xfp -maxcpu=1
[srv-comp-hn:04909] mca: base: components_register: registering plm components
[srv-comp-hn:04909] mca: base: components_register: found loaded component 
isolated
[srv-comp-hn:04909] mca: base: components_register: component isolated has no 
register or open function
[srv-comp-hn:04909] mca: base: components_register: found loaded component rsh
[srv-comp-hn:04909] mca: base: components_register: component rsh register 
function successful
[srv-comp-hn:04909] mca: base: components_register: found loaded component slurm
[srv-comp-hn:04909] mca: base: components_register: component slurm register 
function successful
[srv-comp-hn:04909] mca: base: components_open: opening plm components
[srv-comp-hn:04909] mca: base: components_open: found loaded component isolated
[srv-comp-hn:04909] mca: base: components_open: component isolated open 
function successful
[srv-comp-hn:04909] mca: base: components_open: found loaded component rsh
[srv-comp-hn:04909] mca: base: components_open: component rsh open function 
successful
[srv-comp-hn:04909] mca: base: components_open: found loaded component slurm
[srv-comp-hn:04909] mca: base: components_open: component slurm open function 
successful
[srv-comp-hn:04909] mca:base:select: Auto-selecting plm components
[srv-comp-hn:04909] mca:base:select:(  plm) Querying component [isolated]
[srv-comp-hn:04909] mca:base:select:(  plm) Query of component [isolated] set 
priority to 0
[srv-comp-hn:04909] mca:base:select:(  plm) Querying component [rsh]
[srv-comp-hn:04909] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[srv-comp-hn:04909] mca:base:select:(  plm) Querying component [slurm]
[srv-comp-hn:04909] mca:base:select:(  plm) Skipping component [slurm]. Query 
failed to return a module
[srv-comp-hn:04909] mca:base:select:(  plm) Selected component [rsh]
[srv-comp-hn:04909] mca: base: close: component isolated closed
[srv-comp-hn:04909] mca: base: close: unloading component isolated
[srv-comp-hn:04909] mca: base: close: component slurm closed
[srv-comp-hn:04909] mca: base: close: unloading component slurm
[srv-comp-hn:04909] [[15143,0],0] plm:rsh: final template argv:
        /usr/bin/ssh <template>  orted --hnp-topo-sig 
0N:4S:4L3:4L2:4L1:8C:8H:x86_64 -mca ess "env" -mca orte_ess_jobid "992411648" 
-mca orte_ess_vpid "<template>" -mca orte_ess_num_procs "4" -mca orte_hnp_uri 
"992411648.0;tcp://10.1.28.49,192.168.122.1:33405" --tree-spawn --mca 
plm_base_verbose "10" -mca plm "rsh" -mca rmaps_ppr_n_pernode "1" --tree-spawn
[srv-comp01:130272] mca: base: components_register: registering plm components
[srv-comp01:130272] mca: base: components_register: found loaded component rsh
[srv-comp01:130272] mca: base: components_register: component rsh register 
function successful
[srv-comp01:130272] mca: base: components_open: opening plm components
[srv-comp01:130272] mca: base: components_open: found loaded component rsh
[srv-comp01:130272] mca: base: components_open: component rsh open function 
successful
[srv-comp01:130272] mca:base:select: Auto-selecting plm components
[srv-comp01:130272] mca:base:select:(  plm) Querying component [rsh]
[srv-comp01:130272] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[srv-comp01:130272] mca:base:select:(  plm) Selected component [rsh]
[srv-comp01:130272] [[15143,0],1] plm:rsh: final template argv:
        /usr/bin/ssh <template>  orted --hnp-topo-sig 
0N:35S:35L3:35L2:35L1:35C:35H:x86_64 -mca ess "env" -mca orte_ess_jobid 
"992411648" -mca orte_ess_vpid "<template>" -mca orte_ess_num_procs "4" -mca 
orte_parent_uri "992411648.1;tcp://10.1.28.50,192.168.122.1:34662" -mca 
orte_hnp_uri "992411648.0;tcp://10.1.28.49,192.168.122.1:33405" --mca 
plm_base_verbose "10" -mca rmaps_ppr_n_pernode "1" -mca plm "rsh" --tree-spawn
[srv-comp02:33362] mca: base: components_register: registering plm components
[srv-comp02:33362] mca: base: components_register: found loaded component rsh
[srv-comp02:33362] mca: base: components_register: component rsh register 
function successful
[srv-comp02:33362] mca: base: components_open: opening plm components
[srv-comp02:33362] mca: base: components_open: found loaded component rsh
[srv-comp02:33362] mca: base: components_open: component rsh open function 
successful
[srv-comp02:33362] mca:base:select: Auto-selecting plm components
[srv-comp02:33362] mca:base:select:(  plm) Querying component [rsh]
[srv-comp02:33362] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[srv-comp02:33362] mca:base:select:(  plm) Selected component [rsh]
[srv-comp03:89338] mca: base: components_register: registering plm components
[srv-comp03:89338] mca: base: components_register: found loaded component rsh
[srv-comp03:89338] mca: base: components_register: component rsh register 
function successful
[srv-comp03:89338] mca: base: components_open: opening plm components
[srv-comp03:89338] mca: base: components_open: found loaded component rsh
[srv-comp03:89338] mca: base: components_open: component rsh open function 
successful
[srv-comp03:89338] mca:base:select: Auto-selecting plm components
[srv-comp03:89338] mca:base:select:(  plm) Querying component [rsh]
[srv-comp03:89338] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[srv-comp03:89338] mca:base:select:(  plm) Selected component [rsh]
[srv-comp-hn:04909] [[15143,0],0] plm:base:receive update proc state command 
from [[15143,0],1]
[srv-comp-hn:04909] [[15143,0],0] plm:base:receive got update_proc_state for 
job [15143,1]
[srv-comp-hn:04909] [[15143,0],0] plm:base:receive update proc state command 
from [[15143,0],2]
[srv-comp-hn:04909] [[15143,0],0] plm:base:receive got update_proc_state for 
job [15143,1]
[srv-comp-hn:04909] [[15143,0],0] plm:base:receive update proc state command 
from [[15143,0],3]
[srv-comp-hn:04909] [[15143,0],0] plm:base:receive got update_proc_state for 
job [15143,1]
[ INFO  ] ## SIMULATION START ##
[ INFO  ] XFlow Build 106.00
[ INFO  ] Execution line: /mntnfs/eng-nfs/Apps/XFlow/engine-3d-mpi-ompi10 
/mntnfs/eng-nfs/jsmith/XFlow/Periodic/PeriodicCavity_MPI3.xfp -maxcpu=1
[ INFO  ] Computation limited to: 1 cores per node.
[ INFO  ]
[ INFO  ] License validation OK
^C[srv-comp-hn:04909] [[15143,0],0] plm:base:receive update proc state command 
from [[15143,0],2]
[srv-comp-hn:04909] [[15143,0],0] plm:base:receive got update_proc_state for 
job [15143,1]
[srv-comp02:33362] mca: base: close: component rsh closed
[srv-comp02:33362] mca: base: close: unloading component rsh
[srv-comp-hn:04909] [[15143,0],0] plm:base:receive update proc state command 
from [[15143,0],1]
[srv-comp-hn:04909] [[15143,0],0] plm:base:receive got update_proc_state for 
job [15143,1]
[srv-comp-hn:04909] [[15143,0],0] plm:base:receive update proc state command 
from [[15143,0],3]
[srv-comp-hn:04909] [[15143,0],0] plm:base:receive got update_proc_state for 
job [15143,1]
[srv-comp03:89338] mca: base: close: component rsh closed
[srv-comp03:89338] mca: base: close: unloading component rsh
[srv-comp-hn:04909] mca: base: close: component rsh closed
[srv-comp-hn:04909] mca: base: close: unloading component rsh
[srv-comp01:130272] mca: base: close: component rsh closed
[srv-comp01:130272] mca: base: close: unloading component rsh
[symapp@srv-comp-hn ~]$


[symapp@srv-comp-hn ~]$ mpirun -host srv-comp01,srv-comp02,srv-comp03 hostname
srv-comp02
srv-comp01
srv-comp03

[symapp@srv-comp-hn ~]$ env | grep -i path
MANPATH=:/opt/pbs/share/man
LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:
PATH=/usr/lib64/openmpi/bin:/opt/CD-adapco/13.04.011/STAR-View+13.04.011/bin:/opt/CD-adapco/13.04.011/STAR-CCM+13.04.011/star/bin:/opt/CD-adapco/13.04.010/STAR-View+13.04.010/bin:/opt/CD-adapco/13.04.010/STAR-CCM+13.04.010/star/bin:/mntnfs/eng-nfs/Apps/Abaqus/Commands:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/bin:/home/symapp/.local/bin:/home/symapp/bin
MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles



--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>

Reply via email to