Hello Alex,
At LLNL, we use io-watchdog for this kind of capability.

https://github.com/grondo/io-watchdog

It's a library that you LD_PRELOAD, and it itercepts write calls on a 
particular rank.  Whenever rank 0 issues a write() call it updates a timer 
value also accessed by a thread.  If the thread finds that the last write 
happened beyond some user defined time limit, it then invokes a series of 
actions defined by the user.  The common actions we use are 1) collect a stack 
trace using STAT, 2) email the user, 3) kill the job.  This automates the 
process of restarting a job once it hangs.

We've integrated io-watchdog into SLURM, which makes it very easy to use.
-Adam

________________________________________
From: users [users-boun...@open-mpi.org] on behalf of Cihan Altinay 
[c.alti...@uq.edu.au]
Sent: Saturday, June 18, 2016 1:26 AM
To: us...@open-mpi.org
Subject: Re: [OMPI users] Restart after code hangs

How about sending a 'ping' to a socket periodically which is monitored
by an auxiliary program that runs where the master process runs?

Also, I know you don't want to delve into the third-party libs but have
you actually tried to get to the bottom of the hang, e.g. run an strace,
attach a debugger or if you have the intel tools available you could run
the MPI profiling tool or similar? Maybe it's something more fundamental?!

Good luck,
Cihan

On 18/06/16 01:58, Alex Kaiser wrote:
> An outside monitor should work. My outline of the monitor script (with
> advice from the sys admin) has opportunities for bugs with environment
> variables and such.
>
> I wanted to make sure there was not a simpler solution, or one that is
> less error prone. Modifying the main routine which calls the library or
> external scripts is no problem, I only meant that I did not want to
> debug the library internals, which are huge and complicated!
>
> Appreciate the advice. Thank you,
> Alex
>
> On Friday, June 17, 2016, Ralph Castain <r...@open-mpi.org
> <mailto:r...@open-mpi.org>> wrote:
>
>     Sadly, no - there was some possibility of using a file monitor we
>     had for awhile, but that isn’t in the 1.6 series. So I fear your
>     best bet is to periodically output some kind of marker, and have a
>     separate process that monitors to see if it is being updated. Either
>     way would require modifying code and that seems to be outside the
>     desired scope of the solution.
>
>     Afraid I don’t know how to accomplish what you seek without code
>     modification.
>
>>     On Jun 16, 2016, at 10:16 PM, Alex Kaiser <adkai...@gmail.com> wrote:
>>
>>     Dear Dr. Castain,
>>
>>     I'm using 1.6.5, which is pre-built on NYU's cluster. Is there any
>>     other info which would be helpful? Partial output follows.
>>
>>     Thanks,
>>     Alex
>>
>>     -bash-4.1$ ompi_info
>>
>>         Package: Open MPI l...@soho.es.its.nyu.edu Distribution
>>         Open MPI: 1.6.5
>>         ...
>>         C compiler family name: GNU
>>         C compiler version: 4.8.2
>>
>>
>>     On Thu, Jun 16, 2016 at 8:44 PM, Gus Correa
>>     <g...@ldeo.columbia.edu> wrote:
>>
>>         Hi Alex
>>
>>         You know all this, but just in case ...
>>
>>         Restartable code goes like this:
>>
>>         *****************************
>>         .start
>>
>>         read the initial/previous configuration from a file
>>         ...
>>         final_step = first_step + nsteps
>>         time_step = first_step
>>         while ( time_step .le. final_step )
>>           ... march in time ...
>>           time_step = time_step +1
>>         end
>>
>>         save the final_step configuration (or phase space) to a file
>>         [depending on the algorithm you may need to save the
>>         previous config also, or perhaps a few more]
>>
>>         .end
>>         ************************************************
>>
>>         Then restart the job time and again, until the desired
>>         number of time steps is completed.
>>
>>         Job queue systems/resource managers allow a job to resubmit
>>         itself,
>>         so unless a job fails it feels like a single time integration.
>>
>>         If a job fails in the middle, you don't lose all work,
>>         just restart from the previous saved configuration.
>>         That is the only situation where you need to "monitor" the code.
>>         Resource managers/ queue systems can also email you in
>>         case the job fails, warning you to do manual intervention.
>>
>>         The time granularity per job (nsteps) is up to you.
>>         Normally it is adjusted to the max walltime of job queues
>>         (in a shared computer/cluster),
>>         but in your case it can be adjusted to how often the program
>>         fails.
>>
>>         All atmosphere/ocean/climate/weather_forecast models work
>>         this way (that's what we mostly run here).
>>         I guess most CFD, computational Chemistry, etc, programs also do.
>>
>>         I hope this helps,
>>         Gus Correa
>>
>>
>>
>>         On 06/16/2016 05:25 PM, Alex Kaiser wrote:
>>
>>             Hello,
>>
>>             I have an MPI code which sometimes hangs, simply stops
>>             running. It is
>>             not clear why and it uses many large third party libraries
>>             so I do not
>>             want to try to fix it. The code is easy to restart, but
>>             then it needs to
>>             be monitored closely by me, and I'd prefer to do it
>>             automatically.
>>
>>             Is there a way, within an MPI process, to detect the hang
>>             and abort if so?
>>
>>             In psuedocode, I would like to do something like
>>
>>                 for (all time steps)
>>                      step
>>                      if (nothing has happened for x minutes)
>>
>>                          call mpi abort to return control to the shell
>>
>>                      endif
>>
>>                 endfor
>>
>>             This structure does not work however, as it can no longer
>>             do anything,
>>             including check itself, when it is stuck.
>>
>>
>>             Thank you,
>>             Alex
>>
>>
>>
>>             _______________________________________________
>>             users mailing list
>>             us...@open-mpi.org
>>             Subscription:
>>             https://www.open-mpi.org/mailman/listinfo.cgi/users
>>             Link to this post:
>>             http://www.open-mpi.org/community/lists/users/2016/06/29471.php
>>
>>
>>         _______________________________________________
>>         users mailing list
>>         us...@open-mpi.org
>>         Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>         Link to this post:
>>         http://www.open-mpi.org/community/lists/users/2016/06/29473.php
>>
>>
>>     _______________________________________________
>>     users mailing list
>>     us...@open-mpi.org
>>     Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>     Link to this post:
>>     http://www.open-mpi.org/community/lists/users/2016/06/29474.php
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/06/29481.php
>

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/06/29482.php

Reply via email to