Hi,

this is very puzzling ...


is your application using MPI_Comm_spawn and friends ?

If not, is orted on node k2n01 *really* dead ? or does the head node incorrectly believes orted died ?


you might want to add the following configuration in your ~/.ssh/config

TCPKeepAlive=yes

ServerAliveInterval=60


you might also want to use a lower value for (kernel) net.ipv4.tcp_keep_alive_time

(default is 7200 seconds)


also, which interconnect are you using ?

if mxm is available on your system, i will be used.

if you do not want to use mxm, then you can

mpirun --mca pml ob1



also did you run a dmesg on k2n01 ? a common and hard to troubleshoot issue is the oom-killer killed orted (!)


Cheers,


Gilles

On 8/12/2016 9:33 AM, Blosch, Edwin L wrote:
I had another observation of the problem, with a little more insight.  I can 
confirm that the job has been running several hours before dying with the 'ORTE 
was unable to reliably start' message.  Somehow it is possible.   I had used 
the following options to try and get some more diagnostics:   --output-filename 
mpirun-stdio -mca btl ^tcp --mca plm_base_verbose 10 --mca btl_base_verbose 30

In the stack traces of each process, I saw roughly half of them reported dying 
at an MPI_BARRIER() call.  The rest had progressed further, and they were at an 
MPI_WAITALL command.  It is implemented like this:  Every process posts 
non-blocking receives (IRECV), hits an MPI_BARRIER, then everybody posts 
non-blocking sends (ISEND), then MPI_WAITALL.  This entire exchange process 
happens twice in a row, sending different sets of variables.  The application 
type is unstructured CFD, so any given process is talking to 10 to 15 other 
processes exchanging data across domain boundaries.  There are a range of 
message sizes flying around, some as small as 500 bytes, others as large as 1 
MB.  I'm using 480 processes.

I'm wondering if I'm kicking off too many of these non-blocking messages and 
some network resource is getting exhausted, and perhaps orted is doing some 
kind of 'ping' to make sure everyone is still alive, and it can't reach some 
process, and so the error suggests a startup problem.  Wild guesses, no idea 
really.

For what it's worth, the barrier wasn't in an earlier implementation of this 
routine.  I was seeing some jobs dying suddenly with MxM library errors, and I 
put this barrier in place, and those problems seemed to go away.  So it just 
got committed and forgotten a couple years ago.  I thought (still think) the 
code is correct without the barrier.

Also, I am running under MVAPICH at the moment and not having the same problems 
yet.

Finally, using the same exact model and application, I had a failure that left 
a different message:
--------------------------------------------------------------------------
ORTE has lost communication with its daemon located on node:

   hostname:  k2n01

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.

--------------------------------------------------------------------------



-----Original Message-----
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ralph Castain
Sent: Friday, July 29, 2016 7:38 PM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] EXTERNAL: Re: Question on run-time error "ORTE was unable 
to reliably start"

Really scratching my head over this one. The app won’t start running until 
after all the daemons have been launched, so this doesn’t seem possible at 
first glance. I’m wondering if something else is going on that might lead to a 
similar error? Does the application call comm_spawn, for example? Or is it a 
script that eventually attempts to launch another job?


On Jul 28, 2016, at 6:24 PM, Blosch, Edwin L <edwin.l.blo...@lmco.com> wrote:

Cray CS400, RedHat 6.5, PBS Pro (but OpenMPI is built --without-tm),
OpenMPI 1.8.8, ssh

-----Original Message-----
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of
Ralph Castain
Sent: Thursday, July 28, 2016 4:07 PM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: EXTERNAL: Re: [OMPI users] Question on run-time error "ORTE was unable to 
reliably start"

What kind of system was this on? ssh, slurm, ...?


On Jul 28, 2016, at 1:55 PM, Blosch, Edwin L <edwin.l.blo...@lmco.com> wrote:

I am running cases that are starting just fine and running for a few hours, 
then they die with a message that seems like a startup type of failure.  
Message shown below.  The message appears in standard output from rank 0 
process.  I'm assuming there is a failing card or port or something.

What diagnostic flags can I add to mpirun to help shed light on the problem?

What kinds of problems could cause this kind of message, which looks start-up 
related, after the job has already been running many hours?

Ed

---------------------------------------------------------------------
-
---- ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on  one or more
nodes. Please check your PATH and LD_LIBRARY_PATH  settings, or
configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are
required  (e.g., on Cray). Please check your configure cmd line and
consider using  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a  lack
of common network interfaces and/or no route found between  them.
Please check network connectivity (including firewalls  and network
routing requirements).
---------------------------------------------------------------------
-
--- _______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to