Though I would not recommend your technique for initiating a checkpoint from an application, it may work. Since ompi-checkpoint will need to contact and interact with every MPI process, this could cause problems if the application is blocking in system() while ompi- checkpoint is trying to interact with the process. Additionally if you are using any fork()-sensitive software/hardware (some high-speed interconnects fall into this category) then calling system() (which uses fork() on the back end) may cause a variety of problems including memory corruption.

That being said, if you have configured Open MPI to use the C/R Fault Tolerance thread then this may work. You will want to make sure that only one MPI process in the entire job calls ompi-checkpoint (which is probably the cause of the problem you mention below). The rest of the processes can sit in a MPI_Barrier on the other side of the mychkpt() operation if you want your processes to wait for the checkpoint to finish before proceeding (though this is not required). Additionally the MPI process that calls ompi-checkpoint will always need to be on the same node as the mpirun process in order for the ompi-checkpoint command to work.

Give that a try and let me know if it helps.


As a side note, I have an API for initiating a checkpoint operation through Open MPI's Extensions interface. It is nearly ready, and will probably be available on the Open MPI trunk in the next couple months. I'll post the list when it is available if you want to give that a try.

-- Josh

On Aug 27, 2009, at 10:24 PM, Jean Potsam wrote:

Dear all,
I am trying to checkpoint an mpi application at specific points in my program. So, i created a small function as follows:

void mychkpt()
{
system ("ompi-checkpoint -v `pidof mpirun`");
}

and I am calling it in my MPI application at specific points. e.g

##############
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 6");
mychkpt();
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 4");
mychkpt();
#############

If i do:
 mpirun -am ft-enable-cr -np 1 mpisleepts0,

it works fine. but if i use more than 1 node there is a problem. e.g

mpirun -am ft-enable-cr -np 2 mpisleepts0

I get

################
I am processor no 0 of a total of 2 procs
I am processor no 1 of a total of 2 procs
[jean:13673] orte_checkpoint: Checkpointing...
[jean:13673]      PID 13647
[jean:13673]      Connected to Mpirun [[28355,0],0]
[jean:13673] orte_checkpoint: notify_hnp: Contact Head Node Process PID 13647 [jean:13673] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID]
[jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
[jean:13673] orte_checkpoint: hnp_receiver: Status Update.
[jean:13673] Requested - Global Snapshot Reference: (null)
[jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
[jean:13673] orte_checkpoint: hnp_receiver: Status Update.
[jean:13673] Pending - Global Snapshot Reference: (null)
[jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
[jean:13673] orte_checkpoint: hnp_receiver: Status Update.
[jean:13673] Running - Global Snapshot Reference: (null)
[jean:13672] orte_checkpoint: Checkpointing...
[jean:13672]      PID 13647
[jean:13672]      Connected to Mpirun [[28355,0],0]
[jean:13672] orte_checkpoint: notify_hnp: Contact Head Node Process PID 13647 [jean:13672] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID]
[jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
[jean:13673] orte_checkpoint: hnp_receiver: Status Update.
[jean:13673] File Transfer - Global Snapshot Reference: (null)
[jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
[jean:13673] orte_checkpoint: hnp_receiver: Status Update.
[jean:13673] Finished - Global Snapshot Reference: ompi_global_snapshot_13647.ckptSnapshot Ref.: 0 ompi_global_snapshot_13647.ckpt
^Xmpirun: killing job...
#################

It runs the function twice simultaneously which try to call the checkpointing process twice...thus causing problems.

How can i ensure that the checkpointing process is called only once when there are more than one process running?

Please given me some ideas on it.

Thank you

Jean

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to