Stefan,

which version of OpenMPI are you using ?

when does the error occur ?
is it before MPI_Init() completes ?
is it in the middle of the job ? if yes, are you sure no task invoked MPI_Abort() ?

also, you might want to check the system logs and make sure there was no OOM (Out Of Memory). a possible explanation could be some tasks caused OOM, and the OOM killer chose to kill orted instead of a.out

if you cannot access your system log, you can try with a large number of nodes, and one mpi task per node, and then increase the number of tasks per node and see if problem starts happening.

of course, you can try
mpirun --mca oob_tcp_if_include eth0 ...
to be on the safe side

you can also try to run your application over TCP and see if it helps
(note, the issue might be hidden since TCP is much slower than native PSM)

mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include eth0 ...
or
mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include ib0 ...

/* feel free to replace vader with sm, if vader is not available on your system */

Cheers,

Gilles

On 4/12/2016 4:37 PM, Stefan Friedel wrote:
Good Morning List,
we have a problem on our cluster with bigger jobs (~> 200 nodes) -
almost every job ends with a message like:

###################
Starting at Mon Apr 11 15:54:06 CEST 2016
Running on hosts: stek[034-086,088-201,203-247,249-344,346-379,381-388]
Running on 350 nodes.
Current working directory is /export/homelocal/sfriedel/beff
--------------------------------------------------------------------------
ORTE has lost communication with its daemon located on node:

 hostname:  stek346=20

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.

-------------------------------------------------------------------------- --------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
Program finished with exit code 0 at: Mon Apr 11 15:54:41 CEST 2016
##########################

I found a similar question on the list by Emyr James (2015-10-01) but
nobody answered until now.

Cluster: Dual-Intel Xeon E5-2630 v3 Haswell, Intel/Qlogic Truescale IB QDR,
Debian Jessie 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u2,
openmpi-1.10.2, slurm-15.08.9; homes mounted via NFS/RDMA/ipoib, mpi messages
over psm/IB + 1G Ethernet (Mgmt, pxe boot, ssh, openmpi tcp network etc.)

Jobs are started via slurm sbatch/script (mpirun --mca mtl psm ~/path/to/binary)

Already tested:
*several mca settings (in ...many... combinations)
mtl_psm_connect_timeout 600
oob_tcp_keepalive_time 600
oob_tcp_if_include eth0
oob_tcp_listen_mode listen_thread=20
oob_tcp_keepalive_time 600

*several network/sysctl settings (in ...many... combinations)
/sbin/sysctl -w net.core.somaxconn=3D20000
/sbin/sysctl -w net.core.netdev_max_backlog=3D200000
/sbin/sysctl -w net.ipv4.tcp_max_syn_backlog=3D102400
/sbin/sysctl -w net.ipv4.ip_local_port_range=3D"15000 61000"
/sbin/sysctl -w net.ipv4.tcp_fin_timeout=3D10
/sbin/sysctl -w net.ipv4.tcp_tw_recycle=3D1
/sbin/sysctl -w net.ipv4.tcp_tw_reuse=3D1
/sbin/sysctl -w net.ipv4.tcp_mem=3D"383865   511820   2303190"
echo 20000500 > /proc/sys/fs/nr_open

*ulimit stuff

Routing on the nodes: two private networks 10.203.0.0/22 eth0 and 10.203.40.0/22
ib0, both with their routes, no default route.

If I start the job with debugging/logging (--mca oob_tcp_debug 5 --mca
oob_base_verbose 8) it takes much longer until the error occurs and the job is starting on the nodes (producing some timesteps of output) but it will fail at
some later point.

Any hint? PSM? Some kernel number must be increased? Wrong network/routing
(should not happen with --mca oob_tcp_if_include eth0)??

MfG/Sincerely
Stefan Friedel
--
IWR * 4.317 * INF205 * 69120 Heidelberg
T +49 6221 5414404 * F +49 6221 5414427
stefan.frie...@iwr.uni-heidelberg.de


_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/04/28922.php

Reply via email to