Re: [OMPI users] Restart after code hangs
How about sending a 'ping' to a socket periodically which is monitored
by an auxiliary program that runs where the master process runs?
Also, I know you don't want to delve into the third-party libs but have
you actually tried to get to th
How about sending a 'ping' to a socket periodically which is monitored
by an auxiliary program that runs where the master process runs?
Also, I know you don't want to delve into the third-party libs but have
you actually tried to get to the bottom of the hang, e.g. run an strace,
attach a debu
An outside monitor should work. My outline of the monitor script (with
advice from the sys admin) has opportunities for bugs with environment
variables and such.
I wanted to make sure there was not a simpler solution, or one that is less
error prone. Modifying the main routine which calls the libr
Sadly, no - there was some possibility of using a file monitor we had for
awhile, but that isn’t in the 1.6 series. So I fear your best bet is to
periodically output some kind of marker, and have a separate process that
monitors to see if it is being updated. Either way would require modifying c
Dear Dr. Correa,
This is indeed the structure, it is a CFD program. Most of what you are
suggesting is my current workflow, including saving, sending emails upon a
crash and restarting.
The problem is that the code does not crash but hangs. If it is deadlocked
then it sits there spinning cycles u
Dear Dr. Castain,
I'm using 1.6.5, which is pre-built on NYU's cluster. Is there any other
info which would be helpful? Partial output follows.
Thanks,
Alex
-bash-4.1$ ompi_info
Package: Open MPI l...@soho.es.its.nyu.edu Distribution
Open MPI: 1.6.5
...
C compiler family name: GNU
C compiler ve
Hi Alex
You know all this, but just in case ...
Restartable code goes like this:
*
.start
read the initial/previous configuration from a file
...
final_step = first_step + nsteps
time_step = first_step
while ( time_step .le. final_step )
... march in time ...
ti
Which version of OMPI are you using?
> On Jun 16, 2016, at 2:25 PM, Alex Kaiser wrote:
>
> Hello,
>
> I have an MPI code which sometimes hangs, simply stops running. It is not
> clear why and it uses many large third party libraries so I do not want to
> try to fix it. The code is easy to re
Hello,
I have an MPI code which sometimes hangs, simply stops running. It is not
clear why and it uses many large third party libraries so I do not want to
try to fix it. The code is easy to restart, but then it needs to be
monitored closely by me, and I'd prefer to do it automatically.
Is there