Hi,
common errors are
- you are mix different openmpi versions (e.g. mpirun 1.10.2 and orted
2.0.0)
=> check your PATH and LD_LIBRARY_PATH do not contain openmpi 2.0.0
directories
=> if not already done, configure openmpi with
--enable-mpirun-prefix-by-default
=> try to run `which mpirun` instead of mpirun
=> check your .bashrc or equivalent does not set anything related to
openmpi 2.0.0
(either directly or via 'module' if you are using them on your
system)
- some libraries required by openmpi 1.10.2 are missing
that typically can happen when openmpi is built with intel compilers
and the runtime cannot be found on the compute nodes
=> if not already done, configure openmpi with
--enable-mpirun-prefix-by-default
=> check your LD_LIBRARY_PATH points to the intel runtime on compute
nodes
generally speaking, you might want to run an openmpi 1.10.2 app without
slurm
(so you can tell whether slurm is to be blamed or not)
you can also run a simple debug job with
srun -N $SLURM_NNODES -n $SLURM_NNODES ldd /.../orted
there should be *no* 'not found' libraries
you should run the previous two tests with at least 2 nodes, since ssh
or srun might have different behavior
Cheers,
Gilles
On 7/22/2016 8:34 AM, Steven Lo wrote:
I'm a newbie to openmpi.
We have openmpi 1.10.2 running on RHEL 7 server. When we submit job
using
"mpirun --mca oob_tcp_if_include ib0 -np 48 ./testjob" via slurm version
16.05.2, we get the following error:
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
Interesting thing is that when we run version 2.0.0 "mpirun"
(without --mca oob_tcp_if_include ib0) via slurm, the error
is gone.
Do you know if this problem is from openmpi or the combination
of slurm and openmpi.
Thanks
Steven.
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/07/29700.php