Really scratching my head over this one. The app won’t start running until after all the daemons have been launched, so this doesn’t seem possible at first glance. I’m wondering if something else is going on that might lead to a similar error? Does the application call comm_spawn, for example? Or is it a script that eventually attempts to launch another job?
> On Jul 28, 2016, at 6:24 PM, Blosch, Edwin L <edwin.l.blo...@lmco.com> wrote: > > Cray CS400, RedHat 6.5, PBS Pro (but OpenMPI is built --without-tm), OpenMPI > 1.8.8, ssh > > -----Original Message----- > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ralph > Castain > Sent: Thursday, July 28, 2016 4:07 PM > To: Open MPI Users <users@lists.open-mpi.org> > Subject: EXTERNAL: Re: [OMPI users] Question on run-time error "ORTE was > unable to reliably start" > > What kind of system was this on? ssh, slurm, ...? > > >> On Jul 28, 2016, at 1:55 PM, Blosch, Edwin L <edwin.l.blo...@lmco.com> wrote: >> >> I am running cases that are starting just fine and running for a few hours, >> then they die with a message that seems like a startup type of failure. >> Message shown below. The message appears in standard output from rank 0 >> process. I'm assuming there is a failing card or port or something. >> >> What diagnostic flags can I add to mpirun to help shed light on the problem? >> >> What kinds of problems could cause this kind of message, which looks >> start-up related, after the job has already been running many hours? >> >> Ed >> >> ---------------------------------------------------------------------- >> ---- ORTE was unable to reliably start one or more daemons. >> This usually is caused by: >> >> * not finding the required libraries and/or binaries on one or more >> nodes. Please check your PATH and LD_LIBRARY_PATH settings, or >> configure OMPI with --enable-orterun-prefix-by-default >> >> * lack of authority to execute on one or more specified nodes. >> Please verify your allocation and authorities. >> >> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). >> Please check with your sys admin to determine the correct location to use. >> >> * compilation of the orted with dynamic libraries when static are >> required (e.g., on Cray). Please check your configure cmd line and >> consider using one of the contrib/platform definitions for your system type. >> >> * an inability to create a connection back to mpirun due to a lack of >> common network interfaces and/or no route found between them. Please >> check network connectivity (including firewalls and network routing >> requirements). >> ---------------------------------------------------------------------- >> --- _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users