On Apr 20, 2009, at 9:29 PM, ESTEBAN MENESES ROJAS wrote:

   Hello.
Is there any way to automatically checkpoint/restart an application in OpenMPI? This is, checkpointing the application without using the command ompi-checkpoint, perhaps via a function call in the application's code itself. The same with the restart after a failure.

Currently Open MPI only supports checkpointing/restart applications using the ompi-checkpoint command and restarting with the ompi-restart command. We do not expose a function call for the application to start the checkpoint operation internally.

On a temporary branch, I developed an interface as part of a proposal to the MPI Forum. It works for a coordinated checkpoint (all processes must call the function similar to barrier). In its current state, it is not ready to come to the trunk just yet since there is some support structure missing that I am still working on.

This branch does not expose an interface to restart a process. What that interface should look like quickly becomes a much more difficult question. If you have ideas on the interface signature and semantics I would be interested in hearing about them.


On a related note, what is the default behavior of an OpenMPI application after one process fails? Does the runtime shut down the whole application?

If a process fails Open MPI, by default, will terminate the whole application. Work is in progress by a couple of the core development teams to provide alternative failure modes, but I do not think any of this work has made it to the development trunk yet.

Best,
Josh


   Thanks. _______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to