Re: [OMPI users] Restart after code hangs

Alex Kaiser Fri, 17 Jun 2016 11:59:05 -0400 (EDT)

An outside monitor should work. My outline of the monitor script (with
advice from the sys admin) has opportunities for bugs with environment
variables and such.


I wanted to make sure there was not a simpler solution, or one that is less
error prone. Modifying the main routine which calls the library or external
scripts is no problem, I only meant that I did not want to debug the
library internals, which are huge and complicated!

Appreciate the advice. Thank you,
Alex

On Friday, June 17, 2016, Ralph Castain <r...@open-mpi.org> wrote:

> Sadly, no - there was some possibility of using a file monitor we had for
> awhile, but that isn’t in the 1.6 series. So I fear your best bet is to
> periodically output some kind of marker, and have a separate process that
> monitors to see if it is being updated. Either way would require modifying
> code and that seems to be outside the desired scope of the solution.
>
> Afraid I don’t know how to accomplish what you seek without code
> modification.
>
> On Jun 16, 2016, at 10:16 PM, Alex Kaiser <adkai...@gmail.com> wrote:
>
> Dear Dr. Castain,
>
> I'm using 1.6.5, which is pre-built on NYU's cluster. Is there any other
> info which would be helpful? Partial output follows.
>
> Thanks,
> Alex
>
> -bash-4.1$ ompi_info
>
> Package: Open MPI l...@soho.es.its.nyu.edu Distribution
> Open MPI: 1.6.5
> ...
> C compiler family name: GNU
> C compiler version: 4.8.2
>
>
> On Thu, Jun 16, 2016 at 8:44 PM, Gus Correa <g...@ldeo.columbia.edu> wrote:
>
>> Hi Alex
>>
>> You know all this, but just in case ...
>>
>> Restartable code goes like this:
>>
>> *****************************
>> .start
>>
>> read the initial/previous configuration from a file
>> ...
>> final_step = first_step + nsteps
>> time_step = first_step
>> while ( time_step .le. final_step )
>>   ... march in time ...
>>   time_step = time_step +1
>> end
>>
>> save the final_step configuration (or phase space) to a file
>> [depending on the algorithm you may need to save the
>> previous config also, or perhaps a few more]
>>
>> .end
>> ************************************************
>>
>> Then restart the job time and again, until the desired
>> number of time steps is completed.
>>
>> Job queue systems/resource managers allow a job to resubmit itself,
>> so unless a job fails it feels like a single time integration.
>>
>> If a job fails in the middle, you don't lose all work,
>> just restart from the previous saved configuration.
>> That is the only situation where you need to "monitor" the code.
>> Resource managers/ queue systems can also email you in
>> case the job fails, warning you to do manual intervention.
>>
>> The time granularity per job (nsteps) is up to you.
>> Normally it is adjusted to the max walltime of job queues
>> (in a shared computer/cluster),
>> but in your case it can be adjusted to how often the program fails.
>>
>> All atmosphere/ocean/climate/weather_forecast models work
>> this way (that's what we mostly run here).
>> I guess most CFD, computational Chemistry, etc, programs also do.
>>
>> I hope this helps,
>> Gus Correa
>>
>>
>>
>> On 06/16/2016 05:25 PM, Alex Kaiser wrote:
>>
>>> Hello,
>>>
>>> I have an MPI code which sometimes hangs, simply stops running. It is
>>> not clear why and it uses many large third party libraries so I do not
>>> want to try to fix it. The code is easy to restart, but then it needs to
>>> be monitored closely by me, and I'd prefer to do it automatically.
>>>
>>> Is there a way, within an MPI process, to detect the hang and abort if
>>> so?
>>>
>>> In psuedocode, I would like to do something like
>>>
>>>     for (all time steps)
>>>          step
>>>          if (nothing has happened for x minutes)
>>>
>>>              call mpi abort to return control to the shell
>>>
>>>          endif
>>>
>>>     endfor
>>>
>>> This structure does not work however, as it can no longer do anything,
>>> including check itself, when it is stuck.
>>>
>>>
>>> Thank you,
>>> Alex
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/06/29471.php
>>>
>>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/06/29473.php
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/06/29474.php
>
>
>

Re: [OMPI users] Restart after code hangs

Reply via email to