Dear Dr. Correa, This is indeed the structure, it is a CFD program. Most of what you are suggesting is my current workflow, including saving, sending emails upon a crash and restarting.
The problem is that the code does not crash but hangs. If it is deadlocked then it sits there spinning cycles until I happen to check. Monitoring the code like this has become inefficient -- sometimes an overnight run works for half an hour and I don't notice until the morning. Also, to restart from this requires sitting in queue again. I will try to better understand job system's automatic resubmit, but for now I do not see how to use this to fix the deadlock problem. After thinking about your email perhaps I can phrase my question more precisely -- How can I return control to the shell if the MPI process has deadlocked? Thank you, Alex On Thu, Jun 16, 2016 at 8:44 PM, Gus Correa <g...@ldeo.columbia.edu> wrote: > Hi Alex > > You know all this, but just in case ... > > Restartable code goes like this: > > ***************************** > .start > > read the initial/previous configuration from a file > ... > final_step = first_step + nsteps > time_step = first_step > while ( time_step .le. final_step ) > ... march in time ... > time_step = time_step +1 > end > > save the final_step configuration (or phase space) to a file > [depending on the algorithm you may need to save the > previous config also, or perhaps a few more] > > .end > ************************************************ > > Then restart the job time and again, until the desired > number of time steps is completed. > > Job queue systems/resource managers allow a job to resubmit itself, > so unless a job fails it feels like a single time integration. > > If a job fails in the middle, you don't lose all work, > just restart from the previous saved configuration. > That is the only situation where you need to "monitor" the code. > Resource managers/ queue systems can also email you in > case the job fails, warning you to do manual intervention. > > The time granularity per job (nsteps) is up to you. > Normally it is adjusted to the max walltime of job queues > (in a shared computer/cluster), > but in your case it can be adjusted to how often the program fails. > > All atmosphere/ocean/climate/weather_forecast models work > this way (that's what we mostly run here). > I guess most CFD, computational Chemistry, etc, programs also do. > > I hope this helps, > Gus Correa > > > > On 06/16/2016 05:25 PM, Alex Kaiser wrote: > >> Hello, >> >> I have an MPI code which sometimes hangs, simply stops running. It is >> not clear why and it uses many large third party libraries so I do not >> want to try to fix it. The code is easy to restart, but then it needs to >> be monitored closely by me, and I'd prefer to do it automatically. >> >> Is there a way, within an MPI process, to detect the hang and abort if so? >> >> In psuedocode, I would like to do something like >> >> for (all time steps) >> step >> if (nothing has happened for x minutes) >> >> call mpi abort to return control to the shell >> >> endif >> >> endfor >> >> This structure does not work however, as it can no longer do anything, >> including check itself, when it is stuck. >> >> >> Thank you, >> Alex >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/06/29471.php >> >> > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29473.php >