Re: [OMPI users] Restart after code hangs

Alex Kaiser Fri, 17 Jun 2016 01:37:01 -0400 (EDT)

Dear Dr. Correa,

This is indeed the structure, it is a CFD program. Most of what you are
suggesting is my current workflow, including saving, sending emails upon a
crash and restarting.


The problem is that the code does not crash but hangs. If it is deadlocked
then it sits there spinning cycles until I happen to check. Monitoring the
code like this has become inefficient -- sometimes an overnight run works
for half an hour and I don't notice until the morning. Also, to restart
from this requires sitting in queue again. I will try to better understand
job system's automatic resubmit, but for now I do not see how to use this
to fix the deadlock problem.

After thinking about your email perhaps I can phrase my question more
precisely -- How can I return control to the shell if the MPI process has
deadlocked?

Thank you,
Alex





On Thu, Jun 16, 2016 at 8:44 PM, Gus Correa <g...@ldeo.columbia.edu> wrote:

> Hi Alex
>
> You know all this, but just in case ...
>
> Restartable code goes like this:
>
> *****************************
> .start
>
> read the initial/previous configuration from a file
> ...
> final_step = first_step + nsteps
> time_step = first_step
> while ( time_step .le. final_step )
>   ... march in time ...
>   time_step = time_step +1
> end
>
> save the final_step configuration (or phase space) to a file
> [depending on the algorithm you may need to save the
> previous config also, or perhaps a few more]
>
> .end
> ************************************************
>
> Then restart the job time and again, until the desired
> number of time steps is completed.
>
> Job queue systems/resource managers allow a job to resubmit itself,
> so unless a job fails it feels like a single time integration.
>
> If a job fails in the middle, you don't lose all work,
> just restart from the previous saved configuration.
> That is the only situation where you need to "monitor" the code.
> Resource managers/ queue systems can also email you in
> case the job fails, warning you to do manual intervention.
>
> The time granularity per job (nsteps) is up to you.
> Normally it is adjusted to the max walltime of job queues
> (in a shared computer/cluster),
> but in your case it can be adjusted to how often the program fails.
>
> All atmosphere/ocean/climate/weather_forecast models work
> this way (that's what we mostly run here).
> I guess most CFD, computational Chemistry, etc, programs also do.
>
> I hope this helps,
> Gus Correa
>
>
>
> On 06/16/2016 05:25 PM, Alex Kaiser wrote:
>
>> Hello,
>>
>> I have an MPI code which sometimes hangs, simply stops running. It is
>> not clear why and it uses many large third party libraries so I do not
>> want to try to fix it. The code is easy to restart, but then it needs to
>> be monitored closely by me, and I'd prefer to do it automatically.
>>
>> Is there a way, within an MPI process, to detect the hang and abort if so?
>>
>> In psuedocode, I would like to do something like
>>
>>     for (all time steps)
>>          step
>>          if (nothing has happened for x minutes)
>>
>>              call mpi abort to return control to the shell
>>
>>          endif
>>
>>     endfor
>>
>> This structure does not work however, as it can no longer do anything,
>> including check itself, when it is stuck.
>>
>>
>> Thank you,
>> Alex
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/06/29471.php
>>
>>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/06/29473.php
>

Re: [OMPI users] Restart after code hangs

Reply via email to