Any assistance with this would be greatly appreciated. I'm running CENTOS 7 with Open MPI 1.10.7 We are using a product called XFlow by 3ds. I have been going back and forth trying to figure out why my OpenMPI job pause when expanding across more than one machine.
I confirmed the OpenMPI environment variable paths to libraries and bin files are correct on all machines (Head Node and 3 Compute Nodes). LD_LIBRARY_PATH=/usr/lib64/openmpi/lib: PATH=/usr/lib64/openmpi/bin: I can run an MPI Job to display the host name. mpirun -host srv-comp01,srv-comp02,srv-comp03 hostname srv-comp02 srv-comp01 srv-comp03 If I run the command which normally pauses and I just identify the same hostname twice, it works fine i.e. mpirun -npernode 2 -host srv-comp01, srv-comp02 {command} At the suggestion of the vendor I tried I have tried "--mca btl tcp,self" the job still pauses at the same spot. The firewall is turned off on all machines. Password-less SSH works without issue. I have tested with this another product we use called starccm (has it's own MPI Provider). I have not run hello_c or ring_c, I see them referenced in the FAQ "11. How can I diagnose problems when running across multiple hosts? " I can't see where to download them from. Here is a verbose output of the command. It always pauses at "[ INFO ] License validation OK" and goes no further. I am able to run the job without MPI on a single host. I'm not sure where to go from here. [symapp@srv-comp-hn ~]$ mpirun --version mpirun (Open MPI) 1.10.7 [symapp@srv-comp-hn ~]$ mpirun -npernode 1 --mca plm_base_verbose 10 -host srv-comp01,srv-comp02,srv-comp03 /mntnfs/eng-nfs/Apps/XFlow/engine-3d-mpi-ompi10 /mntnfs/eng-nfs/jsmith/XFlow/Periodic/PeriodicCavity_MPI3.xfp -maxcpu=1 [srv-comp-hn:04909] mca: base: components_register: registering plm components [srv-comp-hn:04909] mca: base: components_register: found loaded component isolated [srv-comp-hn:04909] mca: base: components_register: component isolated has no register or open function [srv-comp-hn:04909] mca: base: components_register: found loaded component rsh [srv-comp-hn:04909] mca: base: components_register: component rsh register function successful [srv-comp-hn:04909] mca: base: components_register: found loaded component slurm [srv-comp-hn:04909] mca: base: components_register: component slurm register function successful [srv-comp-hn:04909] mca: base: components_open: opening plm components [srv-comp-hn:04909] mca: base: components_open: found loaded component isolated [srv-comp-hn:04909] mca: base: components_open: component isolated open function successful [srv-comp-hn:04909] mca: base: components_open: found loaded component rsh [srv-comp-hn:04909] mca: base: components_open: component rsh open function successful [srv-comp-hn:04909] mca: base: components_open: found loaded component slurm [srv-comp-hn:04909] mca: base: components_open: component slurm open function successful [srv-comp-hn:04909] mca:base:select: Auto-selecting plm components [srv-comp-hn:04909] mca:base:select:( plm) Querying component [isolated] [srv-comp-hn:04909] mca:base:select:( plm) Query of component [isolated] set priority to 0 [srv-comp-hn:04909] mca:base:select:( plm) Querying component [rsh] [srv-comp-hn:04909] mca:base:select:( plm) Query of component [rsh] set priority to 10 [srv-comp-hn:04909] mca:base:select:( plm) Querying component [slurm] [srv-comp-hn:04909] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [srv-comp-hn:04909] mca:base:select:( plm) Selected component [rsh] [srv-comp-hn:04909] mca: base: close: component isolated closed [srv-comp-hn:04909] mca: base: close: unloading component isolated [srv-comp-hn:04909] mca: base: close: component slurm closed [srv-comp-hn:04909] mca: base: close: unloading component slurm [srv-comp-hn:04909] [[15143,0],0] plm:rsh: final template argv: /usr/bin/ssh <template> orted --hnp-topo-sig 0N:4S:4L3:4L2:4L1:8C:8H:x86_64 -mca ess "env" -mca orte_ess_jobid "992411648" -mca orte_ess_vpid "<template>" -mca orte_ess_num_procs "4" -mca orte_hnp_uri "992411648.0;tcp://10.1.28.49,192.168.122.1:33405" --tree-spawn --mca plm_base_verbose "10" -mca plm "rsh" -mca rmaps_ppr_n_pernode "1" --tree-spawn [srv-comp01:130272] mca: base: components_register: registering plm components [srv-comp01:130272] mca: base: components_register: found loaded component rsh [srv-comp01:130272] mca: base: components_register: component rsh register function successful [srv-comp01:130272] mca: base: components_open: opening plm components [srv-comp01:130272] mca: base: components_open: found loaded component rsh [srv-comp01:130272] mca: base: components_open: component rsh open function successful [srv-comp01:130272] mca:base:select: Auto-selecting plm components [srv-comp01:130272] mca:base:select:( plm) Querying component [rsh] [srv-comp01:130272] mca:base:select:( plm) Query of component [rsh] set priority to 10 [srv-comp01:130272] mca:base:select:( plm) Selected component [rsh] [srv-comp01:130272] [[15143,0],1] plm:rsh: final template argv: /usr/bin/ssh <template> orted --hnp-topo-sig 0N:35S:35L3:35L2:35L1:35C:35H:x86_64 -mca ess "env" -mca orte_ess_jobid "992411648" -mca orte_ess_vpid "<template>" -mca orte_ess_num_procs "4" -mca orte_parent_uri "992411648.1;tcp://10.1.28.50,192.168.122.1:34662" -mca orte_hnp_uri "992411648.0;tcp://10.1.28.49,192.168.122.1:33405" --mca plm_base_verbose "10" -mca rmaps_ppr_n_pernode "1" -mca plm "rsh" --tree-spawn [srv-comp02:33362] mca: base: components_register: registering plm components [srv-comp02:33362] mca: base: components_register: found loaded component rsh [srv-comp02:33362] mca: base: components_register: component rsh register function successful [srv-comp02:33362] mca: base: components_open: opening plm components [srv-comp02:33362] mca: base: components_open: found loaded component rsh [srv-comp02:33362] mca: base: components_open: component rsh open function successful [srv-comp02:33362] mca:base:select: Auto-selecting plm components [srv-comp02:33362] mca:base:select:( plm) Querying component [rsh] [srv-comp02:33362] mca:base:select:( plm) Query of component [rsh] set priority to 10 [srv-comp02:33362] mca:base:select:( plm) Selected component [rsh] [srv-comp03:89338] mca: base: components_register: registering plm components [srv-comp03:89338] mca: base: components_register: found loaded component rsh [srv-comp03:89338] mca: base: components_register: component rsh register function successful [srv-comp03:89338] mca: base: components_open: opening plm components [srv-comp03:89338] mca: base: components_open: found loaded component rsh [srv-comp03:89338] mca: base: components_open: component rsh open function successful [srv-comp03:89338] mca:base:select: Auto-selecting plm components [srv-comp03:89338] mca:base:select:( plm) Querying component [rsh] [srv-comp03:89338] mca:base:select:( plm) Query of component [rsh] set priority to 10 [srv-comp03:89338] mca:base:select:( plm) Selected component [rsh] [srv-comp-hn:04909] [[15143,0],0] plm:base:receive update proc state command from [[15143,0],1] [srv-comp-hn:04909] [[15143,0],0] plm:base:receive got update_proc_state for job [15143,1] [srv-comp-hn:04909] [[15143,0],0] plm:base:receive update proc state command from [[15143,0],2] [srv-comp-hn:04909] [[15143,0],0] plm:base:receive got update_proc_state for job [15143,1] [srv-comp-hn:04909] [[15143,0],0] plm:base:receive update proc state command from [[15143,0],3] [srv-comp-hn:04909] [[15143,0],0] plm:base:receive got update_proc_state for job [15143,1] [ INFO ] ## SIMULATION START ## [ INFO ] XFlow Build 106.00 [ INFO ] Execution line: /mntnfs/eng-nfs/Apps/XFlow/engine-3d-mpi-ompi10 /mntnfs/eng-nfs/jsmith/XFlow/Periodic/PeriodicCavity_MPI3.xfp -maxcpu=1 [ INFO ] Computation limited to: 1 cores per node. [ INFO ] [ INFO ] License validation OK ^C[srv-comp-hn:04909] [[15143,0],0] plm:base:receive update proc state command from [[15143,0],2] [srv-comp-hn:04909] [[15143,0],0] plm:base:receive got update_proc_state for job [15143,1] [srv-comp02:33362] mca: base: close: component rsh closed [srv-comp02:33362] mca: base: close: unloading component rsh [srv-comp-hn:04909] [[15143,0],0] plm:base:receive update proc state command from [[15143,0],1] [srv-comp-hn:04909] [[15143,0],0] plm:base:receive got update_proc_state for job [15143,1] [srv-comp-hn:04909] [[15143,0],0] plm:base:receive update proc state command from [[15143,0],3] [srv-comp-hn:04909] [[15143,0],0] plm:base:receive got update_proc_state for job [15143,1] [srv-comp03:89338] mca: base: close: component rsh closed [srv-comp03:89338] mca: base: close: unloading component rsh [srv-comp-hn:04909] mca: base: close: component rsh closed [srv-comp-hn:04909] mca: base: close: unloading component rsh [srv-comp01:130272] mca: base: close: component rsh closed [srv-comp01:130272] mca: base: close: unloading component rsh [symapp@srv-comp-hn ~]$ [symapp@srv-comp-hn ~]$ mpirun -host srv-comp01,srv-comp02,srv-comp03 hostname srv-comp02 srv-comp01 srv-comp03 [symapp@srv-comp-hn ~]$ env | grep -i path MANPATH=:/opt/pbs/share/man LD_LIBRARY_PATH=/usr/lib64/openmpi/lib: PATH=/usr/lib64/openmpi/bin:/opt/CD-adapco/13.04.011/STAR-View+13.04.011/bin:/opt/CD-adapco/13.04.011/STAR-CCM+13.04.011/star/bin:/opt/CD-adapco/13.04.010/STAR-View+13.04.010/bin:/opt/CD-adapco/13.04.010/STAR-CCM+13.04.010/star/bin:/mntnfs/eng-nfs/Apps/Abaqus/Commands:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/bin:/home/symapp/.local/bin:/home/symapp/bin MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles