I'm a newbie to openmpi.
We have openmpi 1.10.2 running on RHEL 7 server. When we submit job using "mpirun --mca oob_tcp_if_include ib0 -np 48 ./testjob" via slurm version 16.05.2, we get the following error: -------------------------------------------------------------------------- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- Interesting thing is that when we run version 2.0.0 "mpirun" (without --mca oob_tcp_if_include ib0) via slurm, the error is gone. Do you know if this problem is from openmpi or the combination of slurm and openmpi. Thanks Steven.