Hello Alex, At LLNL, we use io-watchdog for this kind of capability. https://github.com/grondo/io-watchdog
It's a library that you LD_PRELOAD, and it itercepts write calls on a particular rank. Whenever rank 0 issues a write() call it updates a timer value also accessed by a thread. If the thread finds that the last write happened beyond some user defined time limit, it then invokes a series of actions defined by the user. The common actions we use are 1) collect a stack trace using STAT, 2) email the user, 3) kill the job. This automates the process of restarting a job once it hangs. We've integrated io-watchdog into SLURM, which makes it very easy to use. -Adam ________________________________________ From: users [users-boun...@open-mpi.org] on behalf of Cihan Altinay [c.alti...@uq.edu.au] Sent: Saturday, June 18, 2016 1:26 AM To: us...@open-mpi.org Subject: Re: [OMPI users] Restart after code hangs How about sending a 'ping' to a socket periodically which is monitored by an auxiliary program that runs where the master process runs? Also, I know you don't want to delve into the third-party libs but have you actually tried to get to the bottom of the hang, e.g. run an strace, attach a debugger or if you have the intel tools available you could run the MPI profiling tool or similar? Maybe it's something more fundamental?! Good luck, Cihan On 18/06/16 01:58, Alex Kaiser wrote: > An outside monitor should work. My outline of the monitor script (with > advice from the sys admin) has opportunities for bugs with environment > variables and such. > > I wanted to make sure there was not a simpler solution, or one that is > less error prone. Modifying the main routine which calls the library or > external scripts is no problem, I only meant that I did not want to > debug the library internals, which are huge and complicated! > > Appreciate the advice. Thank you, > Alex > > On Friday, June 17, 2016, Ralph Castain <r...@open-mpi.org > <mailto:r...@open-mpi.org>> wrote: > > Sadly, no - there was some possibility of using a file monitor we > had for awhile, but that isn’t in the 1.6 series. So I fear your > best bet is to periodically output some kind of marker, and have a > separate process that monitors to see if it is being updated. Either > way would require modifying code and that seems to be outside the > desired scope of the solution. > > Afraid I don’t know how to accomplish what you seek without code > modification. > >> On Jun 16, 2016, at 10:16 PM, Alex Kaiser <adkai...@gmail.com> wrote: >> >> Dear Dr. Castain, >> >> I'm using 1.6.5, which is pre-built on NYU's cluster. Is there any >> other info which would be helpful? Partial output follows. >> >> Thanks, >> Alex >> >> -bash-4.1$ ompi_info >> >> Package: Open MPI l...@soho.es.its.nyu.edu Distribution >> Open MPI: 1.6.5 >> ... >> C compiler family name: GNU >> C compiler version: 4.8.2 >> >> >> On Thu, Jun 16, 2016 at 8:44 PM, Gus Correa >> <g...@ldeo.columbia.edu> wrote: >> >> Hi Alex >> >> You know all this, but just in case ... >> >> Restartable code goes like this: >> >> ***************************** >> .start >> >> read the initial/previous configuration from a file >> ... >> final_step = first_step + nsteps >> time_step = first_step >> while ( time_step .le. final_step ) >> ... march in time ... >> time_step = time_step +1 >> end >> >> save the final_step configuration (or phase space) to a file >> [depending on the algorithm you may need to save the >> previous config also, or perhaps a few more] >> >> .end >> ************************************************ >> >> Then restart the job time and again, until the desired >> number of time steps is completed. >> >> Job queue systems/resource managers allow a job to resubmit >> itself, >> so unless a job fails it feels like a single time integration. >> >> If a job fails in the middle, you don't lose all work, >> just restart from the previous saved configuration. >> That is the only situation where you need to "monitor" the code. >> Resource managers/ queue systems can also email you in >> case the job fails, warning you to do manual intervention. >> >> The time granularity per job (nsteps) is up to you. >> Normally it is adjusted to the max walltime of job queues >> (in a shared computer/cluster), >> but in your case it can be adjusted to how often the program >> fails. >> >> All atmosphere/ocean/climate/weather_forecast models work >> this way (that's what we mostly run here). >> I guess most CFD, computational Chemistry, etc, programs also do. >> >> I hope this helps, >> Gus Correa >> >> >> >> On 06/16/2016 05:25 PM, Alex Kaiser wrote: >> >> Hello, >> >> I have an MPI code which sometimes hangs, simply stops >> running. It is >> not clear why and it uses many large third party libraries >> so I do not >> want to try to fix it. The code is easy to restart, but >> then it needs to >> be monitored closely by me, and I'd prefer to do it >> automatically. >> >> Is there a way, within an MPI process, to detect the hang >> and abort if so? >> >> In psuedocode, I would like to do something like >> >> for (all time steps) >> step >> if (nothing has happened for x minutes) >> >> call mpi abort to return control to the shell >> >> endif >> >> endfor >> >> This structure does not work however, as it can no longer >> do anything, >> including check itself, when it is stuck. >> >> >> Thank you, >> Alex >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: >> https://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/06/29471.php >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/06/29473.php >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/06/29474.php > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29481.php > _______________________________________________ users mailing list us...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/06/29482.php