An outside monitor should work. My outline of the monitor script (with advice from the sys admin) has opportunities for bugs with environment variables and such.
I wanted to make sure there was not a simpler solution, or one that is less error prone. Modifying the main routine which calls the library or external scripts is no problem, I only meant that I did not want to debug the library internals, which are huge and complicated! Appreciate the advice. Thank you, Alex On Friday, June 17, 2016, Ralph Castain <r...@open-mpi.org> wrote: > Sadly, no - there was some possibility of using a file monitor we had for > awhile, but that isn’t in the 1.6 series. So I fear your best bet is to > periodically output some kind of marker, and have a separate process that > monitors to see if it is being updated. Either way would require modifying > code and that seems to be outside the desired scope of the solution. > > Afraid I don’t know how to accomplish what you seek without code > modification. > > On Jun 16, 2016, at 10:16 PM, Alex Kaiser <adkai...@gmail.com> wrote: > > Dear Dr. Castain, > > I'm using 1.6.5, which is pre-built on NYU's cluster. Is there any other > info which would be helpful? Partial output follows. > > Thanks, > Alex > > -bash-4.1$ ompi_info > > Package: Open MPI l...@soho.es.its.nyu.edu Distribution > Open MPI: 1.6.5 > ... > C compiler family name: GNU > C compiler version: 4.8.2 > > > On Thu, Jun 16, 2016 at 8:44 PM, Gus Correa <g...@ldeo.columbia.edu> wrote: > >> Hi Alex >> >> You know all this, but just in case ... >> >> Restartable code goes like this: >> >> ***************************** >> .start >> >> read the initial/previous configuration from a file >> ... >> final_step = first_step + nsteps >> time_step = first_step >> while ( time_step .le. final_step ) >> ... march in time ... >> time_step = time_step +1 >> end >> >> save the final_step configuration (or phase space) to a file >> [depending on the algorithm you may need to save the >> previous config also, or perhaps a few more] >> >> .end >> ************************************************ >> >> Then restart the job time and again, until the desired >> number of time steps is completed. >> >> Job queue systems/resource managers allow a job to resubmit itself, >> so unless a job fails it feels like a single time integration. >> >> If a job fails in the middle, you don't lose all work, >> just restart from the previous saved configuration. >> That is the only situation where you need to "monitor" the code. >> Resource managers/ queue systems can also email you in >> case the job fails, warning you to do manual intervention. >> >> The time granularity per job (nsteps) is up to you. >> Normally it is adjusted to the max walltime of job queues >> (in a shared computer/cluster), >> but in your case it can be adjusted to how often the program fails. >> >> All atmosphere/ocean/climate/weather_forecast models work >> this way (that's what we mostly run here). >> I guess most CFD, computational Chemistry, etc, programs also do. >> >> I hope this helps, >> Gus Correa >> >> >> >> On 06/16/2016 05:25 PM, Alex Kaiser wrote: >> >>> Hello, >>> >>> I have an MPI code which sometimes hangs, simply stops running. It is >>> not clear why and it uses many large third party libraries so I do not >>> want to try to fix it. The code is easy to restart, but then it needs to >>> be monitored closely by me, and I'd prefer to do it automatically. >>> >>> Is there a way, within an MPI process, to detect the hang and abort if >>> so? >>> >>> In psuedocode, I would like to do something like >>> >>> for (all time steps) >>> step >>> if (nothing has happened for x minutes) >>> >>> call mpi abort to return control to the shell >>> >>> endif >>> >>> endfor >>> >>> This structure does not work however, as it can no longer do anything, >>> including check itself, when it is stuck. >>> >>> >>> Thank you, >>> Alex >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2016/06/29471.php >>> >>> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/06/29473.php >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29474.php > > >