Stefan,
which version of OpenMPI are you using ?
when does the error occur ?
is it before MPI_Init() completes ?
is it in the middle of the job ? if yes, are you sure no task invoked
MPI_Abort() ?
also, you might want to check the system logs and make sure there was no
OOM (Out Of Memory).
a possible explanation could be some tasks caused OOM, and the OOM
killer chose to kill orted instead of a.out
if you cannot access your system log, you can try with a large number of
nodes, and one mpi task per node, and then increase the number of tasks
per node and see if problem starts happening.
of course, you can try
mpirun --mca oob_tcp_if_include eth0 ...
to be on the safe side
you can also try to run your application over TCP and see if it helps
(note, the issue might be hidden since TCP is much slower than native PSM)
mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include eth0 ...
or
mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include ib0 ...
/* feel free to replace vader with sm, if vader is not available on your
system */
Cheers,
Gilles
On 4/12/2016 4:37 PM, Stefan Friedel wrote:
Good Morning List,
we have a problem on our cluster with bigger jobs (~> 200 nodes) -
almost every job ends with a message like:
###################
Starting at Mon Apr 11 15:54:06 CEST 2016
Running on hosts: stek[034-086,088-201,203-247,249-344,346-379,381-388]
Running on 350 nodes.
Current working directory is /export/homelocal/sfriedel/beff
--------------------------------------------------------------------------
ORTE has lost communication with its daemon located on node:
hostname: stek346=20
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
Program finished with exit code 0 at: Mon Apr 11 15:54:41 CEST 2016
##########################
I found a similar question on the list by Emyr James (2015-10-01) but
nobody answered until now.
Cluster: Dual-Intel Xeon E5-2630 v3 Haswell, Intel/Qlogic Truescale IB
QDR,
Debian Jessie 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u2,
openmpi-1.10.2, slurm-15.08.9; homes mounted via NFS/RDMA/ipoib, mpi
messages
over psm/IB + 1G Ethernet (Mgmt, pxe boot, ssh, openmpi tcp network etc.)
Jobs are started via slurm sbatch/script (mpirun --mca mtl psm
~/path/to/binary)
Already tested:
*several mca settings (in ...many... combinations)
mtl_psm_connect_timeout 600
oob_tcp_keepalive_time 600
oob_tcp_if_include eth0
oob_tcp_listen_mode listen_thread=20
oob_tcp_keepalive_time 600
*several network/sysctl settings (in ...many... combinations)
/sbin/sysctl -w net.core.somaxconn=3D20000
/sbin/sysctl -w net.core.netdev_max_backlog=3D200000
/sbin/sysctl -w net.ipv4.tcp_max_syn_backlog=3D102400
/sbin/sysctl -w net.ipv4.ip_local_port_range=3D"15000 61000"
/sbin/sysctl -w net.ipv4.tcp_fin_timeout=3D10
/sbin/sysctl -w net.ipv4.tcp_tw_recycle=3D1
/sbin/sysctl -w net.ipv4.tcp_tw_reuse=3D1
/sbin/sysctl -w net.ipv4.tcp_mem=3D"383865 511820 2303190"
echo 20000500 > /proc/sys/fs/nr_open
*ulimit stuff
Routing on the nodes: two private networks 10.203.0.0/22 eth0 and
10.203.40.0/22
ib0, both with their routes, no default route.
If I start the job with debugging/logging (--mca oob_tcp_debug 5 --mca
oob_base_verbose 8) it takes much longer until the error occurs and
the job is
starting on the nodes (producing some timesteps of output) but it will
fail at
some later point.
Any hint? PSM? Some kernel number must be increased? Wrong
network/routing
(should not happen with --mca oob_tcp_if_include eth0)??
MfG/Sincerely
Stefan Friedel
--
IWR * 4.317 * INF205 * 69120 Heidelberg
T +49 6221 5414404 * F +49 6221 5414427
stefan.frie...@iwr.uni-heidelberg.de
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/04/28922.php