Good Morning List, we have a problem on our cluster with bigger jobs (~> 200 nodes) - almost every job ends with a message like:
################### Starting at Mon Apr 11 15:54:06 CEST 2016 Running on hosts: stek[034-086,088-201,203-247,249-344,346-379,381-388] Running on 350 nodes. Current working directory is /export/homelocal/sfriedel/beff -------------------------------------------------------------------------- ORTE has lost communication with its daemon located on node: hostname: stek346=20 This is usually due to either a failure of the TCP network connection to the node, or possibly an internal failure of the daemon itself. We cannot recover from this failure, and therefore will terminate the job. -------------------------------------------------------------------------- -------------------------------------------------------------------------- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- Program finished with exit code 0 at: Mon Apr 11 15:54:41 CEST 2016 ########################## I found a similar question on the list by Emyr James (2015-10-01) but nobody answered until now. Cluster: Dual-Intel Xeon E5-2630 v3 Haswell, Intel/Qlogic Truescale IB QDR, Debian Jessie 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u2, openmpi-1.10.2, slurm-15.08.9; homes mounted via NFS/RDMA/ipoib, mpi messages over psm/IB + 1G Ethernet (Mgmt, pxe boot, ssh, openmpi tcp network etc.) Jobs are started via slurm sbatch/script (mpirun --mca mtl psm ~/path/to/binary) Already tested: *several mca settings (in ...many... combinations) mtl_psm_connect_timeout 600 oob_tcp_keepalive_time 600 oob_tcp_if_include eth0 oob_tcp_listen_mode listen_thread=20 oob_tcp_keepalive_time 600 *several network/sysctl settings (in ...many... combinations) /sbin/sysctl -w net.core.somaxconn=3D20000 /sbin/sysctl -w net.core.netdev_max_backlog=3D200000 /sbin/sysctl -w net.ipv4.tcp_max_syn_backlog=3D102400 /sbin/sysctl -w net.ipv4.ip_local_port_range=3D"15000 61000" /sbin/sysctl -w net.ipv4.tcp_fin_timeout=3D10 /sbin/sysctl -w net.ipv4.tcp_tw_recycle=3D1 /sbin/sysctl -w net.ipv4.tcp_tw_reuse=3D1 /sbin/sysctl -w net.ipv4.tcp_mem=3D"383865 511820 2303190" echo 20000500 > /proc/sys/fs/nr_open *ulimit stuff Routing on the nodes: two private networks 10.203.0.0/22 eth0 and 10.203.40.0/22 ib0, both with their routes, no default route. If I start the job with debugging/logging (--mca oob_tcp_debug 5 --mca oob_base_verbose 8) it takes much longer until the error occurs and the job is starting on the nodes (producing some timesteps of output) but it will fail at some later point. Any hint? PSM? Some kernel number must be increased? Wrong network/routing (should not happen with --mca oob_tcp_if_include eth0)?? MfG/Sincerely Stefan Friedel -- IWR * 4.317 * INF205 * 69120 Heidelberg T +49 6221 5414404 * F +49 6221 5414427 stefan.frie...@iwr.uni-heidelberg.de
signature.asc
Description: PGP signature