Hi,

common errors are

- you are mix different openmpi versions (e.g. mpirun 1.10.2 and orted 2.0.0)

=> check your PATH and LD_LIBRARY_PATH do not contain openmpi 2.0.0 directories

=> if not already done, configure openmpi with --enable-mpirun-prefix-by-default

  => try to run `which mpirun` instead of mpirun

=> check your .bashrc or equivalent does not set anything related to openmpi 2.0.0

(either directly or via 'module' if you are using them on your system)


- some libraries required by openmpi 1.10.2 are missing

that typically can happen when openmpi is built with intel compilers and the runtime cannot be found on the compute nodes

=> if not already done, configure openmpi with --enable-mpirun-prefix-by-default

=> check your LD_LIBRARY_PATH points to the intel runtime on compute nodes


generally speaking, you might want to run an openmpi 1.10.2 app without slurm

(so you can tell whether slurm is to be blamed or not)


you can also run a simple debug job with

srun -N $SLURM_NNODES -n $SLURM_NNODES ldd /.../orted

there should be *no* 'not found' libraries


you should run the previous two tests with at least 2 nodes, since ssh or srun might have different behavior



Cheers,


Gilles


On 7/22/2016 8:34 AM, Steven Lo wrote:
I'm a newbie to openmpi.


We have openmpi 1.10.2 running on RHEL 7 server. When we submit job using
"mpirun --mca oob_tcp_if_include ib0 -np 48 ./testjob" via slurm version
16.05.2, we get the following error:

--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------



Interesting thing is that when we run version 2.0.0 "mpirun"
(without --mca oob_tcp_if_include ib0) via slurm, the error
is gone.


Do you know if this problem is from openmpi or the combination
of slurm and openmpi.


Thanks

Steven.

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29700.php


Reply via email to