[OMPI users] orte has lost communication

Stefan Friedel Tue, 12 Apr 2016 03:37:12 -0400 (EDT)

Good Morning List,
we have a problem on our cluster with bigger jobs (~> 200 nodes) -
almost every job ends with a message like:


###################
Starting at Mon Apr 11 15:54:06 CEST 2016
Running on hosts: stek[034-086,088-201,203-247,249-344,346-379,381-388]
Running on 350 nodes.
Current working directory is /export/homelocal/sfriedel/beff
--------------------------------------------------------------------------
ORTE has lost communication with its daemon located on node:

 hostname:  stek346=20

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.

--------------------------------------------------------------------------
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
Program finished with exit code 0 at: Mon Apr 11 15:54:41 CEST 2016
##########################

I found a similar question on the list by Emyr James (2015-10-01) but
nobody answered until now.

Cluster: Dual-Intel Xeon E5-2630 v3 Haswell, Intel/Qlogic Truescale IB QDR,
Debian Jessie 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u2,
openmpi-1.10.2, slurm-15.08.9; homes mounted via NFS/RDMA/ipoib, mpi messages
over psm/IB + 1G Ethernet (Mgmt, pxe boot, ssh, openmpi tcp network etc.)

Jobs are started via slurm sbatch/script (mpirun --mca mtl psm ~/path/to/binary)

Already tested:
*several mca settings (in ...many... combinations)
mtl_psm_connect_timeout 600
oob_tcp_keepalive_time 600
oob_tcp_if_include eth0
oob_tcp_listen_mode listen_thread=20
oob_tcp_keepalive_time 600

*several network/sysctl settings (in ...many... combinations)
/sbin/sysctl -w net.core.somaxconn=3D20000
/sbin/sysctl -w net.core.netdev_max_backlog=3D200000
/sbin/sysctl -w net.ipv4.tcp_max_syn_backlog=3D102400
/sbin/sysctl -w net.ipv4.ip_local_port_range=3D"15000 61000"
/sbin/sysctl -w net.ipv4.tcp_fin_timeout=3D10
/sbin/sysctl -w net.ipv4.tcp_tw_recycle=3D1
/sbin/sysctl -w net.ipv4.tcp_tw_reuse=3D1
/sbin/sysctl -w net.ipv4.tcp_mem=3D"383865   511820   2303190"
echo 20000500 > /proc/sys/fs/nr_open

*ulimit stuff

Routing on the nodes: two private networks 10.203.0.0/22 eth0 and 10.203.40.0/22
ib0, both with their routes, no default route.

If I start the job with debugging/logging (--mca oob_tcp_debug 5 --mca
oob_base_verbose 8) it takes much longer until the error occurs and the job is
starting on the nodes (producing some timesteps of output) but it will fail at
some later point.

Any hint? PSM? Some kernel number must be increased?  Wrong network/routing
(should not happen with --mca oob_tcp_if_include eth0)??

MfG/Sincerely
Stefan Friedel
--
IWR * 4.317 * INF205 * 69120 Heidelberg
T +49 6221 5414404 * F +49 6221 5414427
stefan.frie...@iwr.uni-heidelberg.de

signature.asc
Description: PGP signature

[OMPI users] orte has lost communication

Reply via email to