I'm a newbie to openmpi.

We have openmpi 1.10.2 running on RHEL 7 server.  When we submit job using
"mpirun --mca oob_tcp_if_include ib0 -np 48 ./testjob" via slurm version
16.05.2, we get the following error:

--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------



Interesting thing is that when we run version 2.0.0 "mpirun"
(without --mca oob_tcp_if_include ib0) via slurm, the error
is gone.


Do you know if this problem is from openmpi or the combination
of slurm and openmpi.


Thanks

Steven.

Reply via email to