Sadly, no - there was some possibility of using a file monitor we had for awhile, but that isn’t in the 1.6 series. So I fear your best bet is to periodically output some kind of marker, and have a separate process that monitors to see if it is being updated. Either way would require modifying code and that seems to be outside the desired scope of the solution.
Afraid I don’t know how to accomplish what you seek without code modification. > On Jun 16, 2016, at 10:16 PM, Alex Kaiser <adkai...@gmail.com> wrote: > > Dear Dr. Castain, > > I'm using 1.6.5, which is pre-built on NYU's cluster. Is there any other info > which would be helpful? Partial output follows. > > Thanks, > Alex > > -bash-4.1$ ompi_info > Package: Open MPI l...@soho.es.its.nyu.edu <mailto:l...@soho.es.its.nyu.edu> > Distribution > Open MPI: 1.6.5 > ... > C compiler family name: GNU > C compiler version: 4.8.2 > > On Thu, Jun 16, 2016 at 8:44 PM, Gus Correa <g...@ldeo.columbia.edu > <mailto:g...@ldeo.columbia.edu>> wrote: > Hi Alex > > You know all this, but just in case ... > > Restartable code goes like this: > > ***************************** > .start > > read the initial/previous configuration from a file > ... > final_step = first_step + nsteps > time_step = first_step > while ( time_step .le. final_step ) > ... march in time ... > time_step = time_step +1 > end > > save the final_step configuration (or phase space) to a file > [depending on the algorithm you may need to save the > previous config also, or perhaps a few more] > > .end > ************************************************ > > Then restart the job time and again, until the desired > number of time steps is completed. > > Job queue systems/resource managers allow a job to resubmit itself, > so unless a job fails it feels like a single time integration. > > If a job fails in the middle, you don't lose all work, > just restart from the previous saved configuration. > That is the only situation where you need to "monitor" the code. > Resource managers/ queue systems can also email you in > case the job fails, warning you to do manual intervention. > > The time granularity per job (nsteps) is up to you. > Normally it is adjusted to the max walltime of job queues > (in a shared computer/cluster), > but in your case it can be adjusted to how often the program fails. > > All atmosphere/ocean/climate/weather_forecast models work > this way (that's what we mostly run here). > I guess most CFD, computational Chemistry, etc, programs also do. > > I hope this helps, > Gus Correa > > > > On 06/16/2016 05:25 PM, Alex Kaiser wrote: > Hello, > > I have an MPI code which sometimes hangs, simply stops running. It is > not clear why and it uses many large third party libraries so I do not > want to try to fix it. The code is easy to restart, but then it needs to > be monitored closely by me, and I'd prefer to do it automatically. > > Is there a way, within an MPI process, to detect the hang and abort if so? > > In psuedocode, I would like to do something like > > for (all time steps) > step > if (nothing has happened for x minutes) > > call mpi abort to return control to the shell > > endif > > endfor > > This structure does not work however, as it can no longer do anything, > including check itself, when it is stuck. > > > Thank you, > Alex > > > > _______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > <https://www.open-mpi.org/mailman/listinfo.cgi/users> > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29471.php > <http://www.open-mpi.org/community/lists/users/2016/06/29471.php> > > > _______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > <https://www.open-mpi.org/mailman/listinfo.cgi/users> > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29473.php > <http://www.open-mpi.org/community/lists/users/2016/06/29473.php> > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29474.php