Any assistance with this would be greatly appreciated. I'm running CENTOS 7 
with Open MPI 1.10.7 We are using a product called XFlow by 3ds. I have been 
going back and forth trying to figure out why my OpenMPI job pause when 
expanding across more than one machine.

I confirmed the OpenMPI environment variable paths to libraries and bin files 
are correct on all machines (Head Node and 3 Compute Nodes).
LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:
PATH=/usr/lib64/openmpi/bin:

I can run an MPI Job to display the host name.
mpirun -host srv-comp01,srv-comp02,srv-comp03 hostname
srv-comp02
srv-comp01
srv-comp03

If I run the command which normally pauses and I just identify the same 
hostname twice, it works fine
i.e. mpirun -npernode 2 -host srv-comp01, srv-comp02 {command}

At the suggestion of the vendor I tried I have tried "--mca btl tcp,self" the 
job still pauses at the same spot.

The firewall is turned off on all machines. Password-less SSH works without 
issue. I have tested with this another product we use called starccm (has it's 
own MPI Provider).

I have not run hello_c or ring_c, I see them referenced in the FAQ "11. How can 
I diagnose problems when running across multiple hosts? " I can't see where to 
download them from.

Here is a verbose output of the command. It always pauses at "[ INFO  ] License 
validation OK" and goes no further. I am able to run the job without MPI on a 
single host. I'm not sure where to go from here.

[symapp@srv-comp-hn ~]$ mpirun --version
mpirun (Open MPI) 1.10.7

[symapp@srv-comp-hn ~]$ mpirun -npernode 1 --mca plm_base_verbose 10 -host 
srv-comp01,srv-comp02,srv-comp03 
/mntnfs/eng-nfs/Apps/XFlow/engine-3d-mpi-ompi10 
/mntnfs/eng-nfs/jsmith/XFlow/Periodic/PeriodicCavity_MPI3.xfp -maxcpu=1
[srv-comp-hn:04909] mca: base: components_register: registering plm components
[srv-comp-hn:04909] mca: base: components_register: found loaded component 
isolated
[srv-comp-hn:04909] mca: base: components_register: component isolated has no 
register or open function
[srv-comp-hn:04909] mca: base: components_register: found loaded component rsh
[srv-comp-hn:04909] mca: base: components_register: component rsh register 
function successful
[srv-comp-hn:04909] mca: base: components_register: found loaded component slurm
[srv-comp-hn:04909] mca: base: components_register: component slurm register 
function successful
[srv-comp-hn:04909] mca: base: components_open: opening plm components
[srv-comp-hn:04909] mca: base: components_open: found loaded component isolated
[srv-comp-hn:04909] mca: base: components_open: component isolated open 
function successful
[srv-comp-hn:04909] mca: base: components_open: found loaded component rsh
[srv-comp-hn:04909] mca: base: components_open: component rsh open function 
successful
[srv-comp-hn:04909] mca: base: components_open: found loaded component slurm
[srv-comp-hn:04909] mca: base: components_open: component slurm open function 
successful
[srv-comp-hn:04909] mca:base:select: Auto-selecting plm components
[srv-comp-hn:04909] mca:base:select:(  plm) Querying component [isolated]
[srv-comp-hn:04909] mca:base:select:(  plm) Query of component [isolated] set 
priority to 0
[srv-comp-hn:04909] mca:base:select:(  plm) Querying component [rsh]
[srv-comp-hn:04909] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[srv-comp-hn:04909] mca:base:select:(  plm) Querying component [slurm]
[srv-comp-hn:04909] mca:base:select:(  plm) Skipping component [slurm]. Query 
failed to return a module
[srv-comp-hn:04909] mca:base:select:(  plm) Selected component [rsh]
[srv-comp-hn:04909] mca: base: close: component isolated closed
[srv-comp-hn:04909] mca: base: close: unloading component isolated
[srv-comp-hn:04909] mca: base: close: component slurm closed
[srv-comp-hn:04909] mca: base: close: unloading component slurm
[srv-comp-hn:04909] [[15143,0],0] plm:rsh: final template argv:
        /usr/bin/ssh <template>  orted --hnp-topo-sig 
0N:4S:4L3:4L2:4L1:8C:8H:x86_64 -mca ess "env" -mca orte_ess_jobid "992411648" 
-mca orte_ess_vpid "<template>" -mca orte_ess_num_procs "4" -mca orte_hnp_uri 
"992411648.0;tcp://10.1.28.49,192.168.122.1:33405" --tree-spawn --mca 
plm_base_verbose "10" -mca plm "rsh" -mca rmaps_ppr_n_pernode "1" --tree-spawn
[srv-comp01:130272] mca: base: components_register: registering plm components
[srv-comp01:130272] mca: base: components_register: found loaded component rsh
[srv-comp01:130272] mca: base: components_register: component rsh register 
function successful
[srv-comp01:130272] mca: base: components_open: opening plm components
[srv-comp01:130272] mca: base: components_open: found loaded component rsh
[srv-comp01:130272] mca: base: components_open: component rsh open function 
successful
[srv-comp01:130272] mca:base:select: Auto-selecting plm components
[srv-comp01:130272] mca:base:select:(  plm) Querying component [rsh]
[srv-comp01:130272] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[srv-comp01:130272] mca:base:select:(  plm) Selected component [rsh]
[srv-comp01:130272] [[15143,0],1] plm:rsh: final template argv:
        /usr/bin/ssh <template>  orted --hnp-topo-sig 
0N:35S:35L3:35L2:35L1:35C:35H:x86_64 -mca ess "env" -mca orte_ess_jobid 
"992411648" -mca orte_ess_vpid "<template>" -mca orte_ess_num_procs "4" -mca 
orte_parent_uri "992411648.1;tcp://10.1.28.50,192.168.122.1:34662" -mca 
orte_hnp_uri "992411648.0;tcp://10.1.28.49,192.168.122.1:33405" --mca 
plm_base_verbose "10" -mca rmaps_ppr_n_pernode "1" -mca plm "rsh" --tree-spawn
[srv-comp02:33362] mca: base: components_register: registering plm components
[srv-comp02:33362] mca: base: components_register: found loaded component rsh
[srv-comp02:33362] mca: base: components_register: component rsh register 
function successful
[srv-comp02:33362] mca: base: components_open: opening plm components
[srv-comp02:33362] mca: base: components_open: found loaded component rsh
[srv-comp02:33362] mca: base: components_open: component rsh open function 
successful
[srv-comp02:33362] mca:base:select: Auto-selecting plm components
[srv-comp02:33362] mca:base:select:(  plm) Querying component [rsh]
[srv-comp02:33362] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[srv-comp02:33362] mca:base:select:(  plm) Selected component [rsh]
[srv-comp03:89338] mca: base: components_register: registering plm components
[srv-comp03:89338] mca: base: components_register: found loaded component rsh
[srv-comp03:89338] mca: base: components_register: component rsh register 
function successful
[srv-comp03:89338] mca: base: components_open: opening plm components
[srv-comp03:89338] mca: base: components_open: found loaded component rsh
[srv-comp03:89338] mca: base: components_open: component rsh open function 
successful
[srv-comp03:89338] mca:base:select: Auto-selecting plm components
[srv-comp03:89338] mca:base:select:(  plm) Querying component [rsh]
[srv-comp03:89338] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[srv-comp03:89338] mca:base:select:(  plm) Selected component [rsh]
[srv-comp-hn:04909] [[15143,0],0] plm:base:receive update proc state command 
from [[15143,0],1]
[srv-comp-hn:04909] [[15143,0],0] plm:base:receive got update_proc_state for 
job [15143,1]
[srv-comp-hn:04909] [[15143,0],0] plm:base:receive update proc state command 
from [[15143,0],2]
[srv-comp-hn:04909] [[15143,0],0] plm:base:receive got update_proc_state for 
job [15143,1]
[srv-comp-hn:04909] [[15143,0],0] plm:base:receive update proc state command 
from [[15143,0],3]
[srv-comp-hn:04909] [[15143,0],0] plm:base:receive got update_proc_state for 
job [15143,1]
[ INFO  ] ## SIMULATION START ##
[ INFO  ] XFlow Build 106.00
[ INFO  ] Execution line: /mntnfs/eng-nfs/Apps/XFlow/engine-3d-mpi-ompi10 
/mntnfs/eng-nfs/jsmith/XFlow/Periodic/PeriodicCavity_MPI3.xfp -maxcpu=1
[ INFO  ] Computation limited to: 1 cores per node.
[ INFO  ]
[ INFO  ] License validation OK
^C[srv-comp-hn:04909] [[15143,0],0] plm:base:receive update proc state command 
from [[15143,0],2]
[srv-comp-hn:04909] [[15143,0],0] plm:base:receive got update_proc_state for 
job [15143,1]
[srv-comp02:33362] mca: base: close: component rsh closed
[srv-comp02:33362] mca: base: close: unloading component rsh
[srv-comp-hn:04909] [[15143,0],0] plm:base:receive update proc state command 
from [[15143,0],1]
[srv-comp-hn:04909] [[15143,0],0] plm:base:receive got update_proc_state for 
job [15143,1]
[srv-comp-hn:04909] [[15143,0],0] plm:base:receive update proc state command 
from [[15143,0],3]
[srv-comp-hn:04909] [[15143,0],0] plm:base:receive got update_proc_state for 
job [15143,1]
[srv-comp03:89338] mca: base: close: component rsh closed
[srv-comp03:89338] mca: base: close: unloading component rsh
[srv-comp-hn:04909] mca: base: close: component rsh closed
[srv-comp-hn:04909] mca: base: close: unloading component rsh
[srv-comp01:130272] mca: base: close: component rsh closed
[srv-comp01:130272] mca: base: close: unloading component rsh
[symapp@srv-comp-hn ~]$


[symapp@srv-comp-hn ~]$ mpirun -host srv-comp01,srv-comp02,srv-comp03 hostname
srv-comp02
srv-comp01
srv-comp03

[symapp@srv-comp-hn ~]$ env | grep -i path
MANPATH=:/opt/pbs/share/man
LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:
PATH=/usr/lib64/openmpi/bin:/opt/CD-adapco/13.04.011/STAR-View+13.04.011/bin:/opt/CD-adapco/13.04.011/STAR-CCM+13.04.011/star/bin:/opt/CD-adapco/13.04.010/STAR-View+13.04.010/bin:/opt/CD-adapco/13.04.010/STAR-CCM+13.04.010/star/bin:/mntnfs/eng-nfs/Apps/Abaqus/Commands:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/bin:/home/symapp/.local/bin:/home/symapp/bin
MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles

Reply via email to