I had another observation of the problem, with a little more insight. I can confirm that the job has been running several hours before dying with the 'ORTE was unable to reliably start' message. Somehow it is possible. I had used the following options to try and get some more diagnostics: --output-filename mpirun-stdio -mca btl ^tcp --mca plm_base_verbose 10 --mca btl_base_verbose 30
In the stack traces of each process, I saw roughly half of them reported dying at an MPI_BARRIER() call. The rest had progressed further, and they were at an MPI_WAITALL command. It is implemented like this: Every process posts non-blocking receives (IRECV), hits an MPI_BARRIER, then everybody posts non-blocking sends (ISEND), then MPI_WAITALL. This entire exchange process happens twice in a row, sending different sets of variables. The application type is unstructured CFD, so any given process is talking to 10 to 15 other processes exchanging data across domain boundaries. There are a range of message sizes flying around, some as small as 500 bytes, others as large as 1 MB. I'm using 480 processes. I'm wondering if I'm kicking off too many of these non-blocking messages and some network resource is getting exhausted, and perhaps orted is doing some kind of 'ping' to make sure everyone is still alive, and it can't reach some process, and so the error suggests a startup problem. Wild guesses, no idea really. For what it's worth, the barrier wasn't in an earlier implementation of this routine. I was seeing some jobs dying suddenly with MxM library errors, and I put this barrier in place, and those problems seemed to go away. So it just got committed and forgotten a couple years ago. I thought (still think) the code is correct without the barrier. Also, I am running under MVAPICH at the moment and not having the same problems yet. Finally, using the same exact model and application, I had a failure that left a different message: -------------------------------------------------------------------------- ORTE has lost communication with its daemon located on node: hostname: k2n01 This is usually due to either a failure of the TCP network connection to the node, or possibly an internal failure of the daemon itself. We cannot recover from this failure, and therefore will terminate the job. -------------------------------------------------------------------------- -----Original Message----- From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ralph Castain Sent: Friday, July 29, 2016 7:38 PM To: Open MPI Users <users@lists.open-mpi.org> Subject: Re: [OMPI users] EXTERNAL: Re: Question on run-time error "ORTE was unable to reliably start" Really scratching my head over this one. The app won’t start running until after all the daemons have been launched, so this doesn’t seem possible at first glance. I’m wondering if something else is going on that might lead to a similar error? Does the application call comm_spawn, for example? Or is it a script that eventually attempts to launch another job? > On Jul 28, 2016, at 6:24 PM, Blosch, Edwin L <edwin.l.blo...@lmco.com> wrote: > > Cray CS400, RedHat 6.5, PBS Pro (but OpenMPI is built --without-tm), > OpenMPI 1.8.8, ssh > > -----Original Message----- > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of > Ralph Castain > Sent: Thursday, July 28, 2016 4:07 PM > To: Open MPI Users <users@lists.open-mpi.org> > Subject: EXTERNAL: Re: [OMPI users] Question on run-time error "ORTE was > unable to reliably start" > > What kind of system was this on? ssh, slurm, ...? > > >> On Jul 28, 2016, at 1:55 PM, Blosch, Edwin L <edwin.l.blo...@lmco.com> wrote: >> >> I am running cases that are starting just fine and running for a few hours, >> then they die with a message that seems like a startup type of failure. >> Message shown below. The message appears in standard output from rank 0 >> process. I'm assuming there is a failing card or port or something. >> >> What diagnostic flags can I add to mpirun to help shed light on the problem? >> >> What kinds of problems could cause this kind of message, which looks >> start-up related, after the job has already been running many hours? >> >> Ed >> >> --------------------------------------------------------------------- >> - >> ---- ORTE was unable to reliably start one or more daemons. >> This usually is caused by: >> >> * not finding the required libraries and/or binaries on one or more >> nodes. Please check your PATH and LD_LIBRARY_PATH settings, or >> configure OMPI with --enable-orterun-prefix-by-default >> >> * lack of authority to execute on one or more specified nodes. >> Please verify your allocation and authorities. >> >> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). >> Please check with your sys admin to determine the correct location to use. >> >> * compilation of the orted with dynamic libraries when static are >> required (e.g., on Cray). Please check your configure cmd line and >> consider using one of the contrib/platform definitions for your system type. >> >> * an inability to create a connection back to mpirun due to a lack >> of common network interfaces and/or no route found between them. >> Please check network connectivity (including firewalls and network >> routing requirements). >> --------------------------------------------------------------------- >> - >> --- _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users