Sadly, no - there was some possibility of using a file monitor we had for 
awhile, but that isn’t in the 1.6 series. So I fear your best bet is to 
periodically output some kind of marker, and have a separate process that 
monitors to see if it is being updated. Either way would require modifying code 
and that seems to be outside the desired scope of the solution.

Afraid I don’t know how to accomplish what you seek without code modification.

> On Jun 16, 2016, at 10:16 PM, Alex Kaiser <adkai...@gmail.com> wrote:
> 
> Dear Dr. Castain, 
> 
> I'm using 1.6.5, which is pre-built on NYU's cluster. Is there any other info 
> which would be helpful? Partial output follows.
> 
> Thanks, 
> Alex 
> 
> -bash-4.1$ ompi_info
> Package: Open MPI l...@soho.es.its.nyu.edu <mailto:l...@soho.es.its.nyu.edu> 
> Distribution
> Open MPI: 1.6.5
> ...
> C compiler family name: GNU
> C compiler version: 4.8.2
> 
> On Thu, Jun 16, 2016 at 8:44 PM, Gus Correa <g...@ldeo.columbia.edu 
> <mailto:g...@ldeo.columbia.edu>> wrote:
> Hi Alex
> 
> You know all this, but just in case ...
> 
> Restartable code goes like this:
> 
> *****************************
> .start
> 
> read the initial/previous configuration from a file
> ...
> final_step = first_step + nsteps
> time_step = first_step
> while ( time_step .le. final_step )
>   ... march in time ...
>   time_step = time_step +1
> end
> 
> save the final_step configuration (or phase space) to a file
> [depending on the algorithm you may need to save the
> previous config also, or perhaps a few more]
> 
> .end
> ************************************************
> 
> Then restart the job time and again, until the desired
> number of time steps is completed.
> 
> Job queue systems/resource managers allow a job to resubmit itself,
> so unless a job fails it feels like a single time integration.
> 
> If a job fails in the middle, you don't lose all work,
> just restart from the previous saved configuration.
> That is the only situation where you need to "monitor" the code.
> Resource managers/ queue systems can also email you in
> case the job fails, warning you to do manual intervention.
> 
> The time granularity per job (nsteps) is up to you.
> Normally it is adjusted to the max walltime of job queues
> (in a shared computer/cluster),
> but in your case it can be adjusted to how often the program fails.
> 
> All atmosphere/ocean/climate/weather_forecast models work
> this way (that's what we mostly run here).
> I guess most CFD, computational Chemistry, etc, programs also do.
> 
> I hope this helps,
> Gus Correa
> 
> 
> 
> On 06/16/2016 05:25 PM, Alex Kaiser wrote:
> Hello,
> 
> I have an MPI code which sometimes hangs, simply stops running. It is
> not clear why and it uses many large third party libraries so I do not
> want to try to fix it. The code is easy to restart, but then it needs to
> be monitored closely by me, and I'd prefer to do it automatically.
> 
> Is there a way, within an MPI process, to detect the hang and abort if so?
> 
> In psuedocode, I would like to do something like
> 
>     for (all time steps)
>          step
>          if (nothing has happened for x minutes)
> 
>              call mpi abort to return control to the shell
> 
>          endif
> 
>     endfor
> 
> This structure does not work however, as it can no longer do anything,
> including check itself, when it is stuck.
> 
> 
> Thank you,
> Alex
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users 
> <https://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/06/29471.php 
> <http://www.open-mpi.org/community/lists/users/2016/06/29471.php>
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users 
> <https://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/06/29473.php 
> <http://www.open-mpi.org/community/lists/users/2016/06/29473.php>
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/06/29474.php

Reply via email to