Re: [OMPI users] orte has lost communication

Gilles Gouaillardet Tue, 12 Apr 2016 04:12:01 -0400 (EDT)

Stefan,

which version of OpenMPI are you using ?


when does the error occur ?
is it before MPI_Init() completes ?

is it in the middle of the job ? if yes, are you sure no task invokedMPI_Abort() ?

also, you might want to check the system logs and make sure there was noOOM (Out Of Memory).a possible explanation could be some tasks caused OOM, and the OOMkiller chose to kill orted instead of a.out

if you cannot access your system log, you can try with a large number ofnodes, and one mpi task per node, and then increase the number of tasksper node and see if problem starts happening.


of course, you can try
mpirun --mca oob_tcp_if_include eth0 ...
to be on the safe side

you can also try to run your application over TCP and see if it helps
(note, the issue might be hidden since TCP is much slower than native PSM)

mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include eth0 ...
or
mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include ib0 ...

/* feel free to replace vader with sm, if vader is not available on yoursystem */


Cheers,

Gilles

On 4/12/2016 4:37 PM, Stefan Friedel wrote:

Good Morning List,
we have a problem on our cluster with bigger jobs (~> 200 nodes) -
almost every job ends with a message like:

###################
Starting at Mon Apr 11 15:54:06 CEST 2016
Running on hosts: stek[034-086,088-201,203-247,249-344,346-379,381-388]
Running on 350 nodes.
Current working directory is /export/homelocal/sfriedel/beff

--------------------------------------------------------------------------

ORTE has lost communication with its daemon located on node:

 hostname:  stek346=20

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.

----------------------------------------------------------------------------------------------------------------------------------------------------

An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).

--------------------------------------------------------------------------

Program finished with exit code 0 at: Mon Apr 11 15:54:41 CEST 2016
##########################

I found a similar question on the list by Emyr James (2015-10-01) but
nobody answered until now.

Cluster: Dual-Intel Xeon E5-2630 v3 Haswell, Intel/Qlogic Truescale IBQDR,

Debian Jessie 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u2,

openmpi-1.10.2, slurm-15.08.9; homes mounted via NFS/RDMA/ipoib, mpimessages

over psm/IB + 1G Ethernet (Mgmt, pxe boot, ssh, openmpi tcp network etc.)

Jobs are started via slurm sbatch/script (mpirun --mca mtl psm~/path/to/binary)


Already tested:
*several mca settings (in ...many... combinations)
mtl_psm_connect_timeout 600
oob_tcp_keepalive_time 600
oob_tcp_if_include eth0
oob_tcp_listen_mode listen_thread=20
oob_tcp_keepalive_time 600

*several network/sysctl settings (in ...many... combinations)
/sbin/sysctl -w net.core.somaxconn=3D20000
/sbin/sysctl -w net.core.netdev_max_backlog=3D200000
/sbin/sysctl -w net.ipv4.tcp_max_syn_backlog=3D102400
/sbin/sysctl -w net.ipv4.ip_local_port_range=3D"15000 61000"
/sbin/sysctl -w net.ipv4.tcp_fin_timeout=3D10
/sbin/sysctl -w net.ipv4.tcp_tw_recycle=3D1
/sbin/sysctl -w net.ipv4.tcp_tw_reuse=3D1
/sbin/sysctl -w net.ipv4.tcp_mem=3D"383865   511820   2303190"
echo 20000500 > /proc/sys/fs/nr_open

*ulimit stuff

Routing on the nodes: two private networks 10.203.0.0/22 eth0 and10.203.40.0/22

ib0, both with their routes, no default route.

If I start the job with debugging/logging (--mca oob_tcp_debug 5 --mca

oob_base_verbose 8) it takes much longer until the error occurs andthe job isstarting on the nodes (producing some timesteps of output) but it willfail at

some later point.

Any hint? PSM? Some kernel number must be increased? Wrongnetwork/routing

(should not happen with --mca oob_tcp_if_include eth0)??

MfG/Sincerely
Stefan Friedel
--
IWR * 4.317 * INF205 * 69120 Heidelberg
T +49 6221 5414404 * F +49 6221 5414427
[email protected]


_______________________________________________
users mailing list
[email protected]
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/04/28922.php

Re: [OMPI users] orte has lost communication

Reply via email to