Dear All, I am trying to checkpoint a test application using openmpi-1.3.1, but fails to do so, when run multiple process on different nodes.
Checkpointing runs fine, if process is running on the same node along with mpirun process. But the moment i launch MPI process from different node, it hangs. ex. mpirun -np 2 ./test (will checkpoint fine using ompi-checkpoint -v <mpirun_pid> ) but mpirun -np 2 -H host1 ./test (Checkpointing will hang) Similarly mpirun -np 2 -H localhost,host1 ./test would still hangs while checkpointing. Please find the output which comes while checkpointing --------------xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx---------------------------- [n0:01596] orte_checkpoint: Checkpointing... [n0:01596] PID 1514 [n0:01596] Connected to Mpirun [[11946,0],0] [n0:01596] orte_checkpoint: notify_hnp: Contact Head Node Process PID 1514 [n0:01596] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] [n0:01596] orte_checkpoint: hnp_receiver: Receive a command message. [n0:01596] orte_checkpoint: hnp_receiver: Status Update. [n0:01596] Requested - Global Snapshot Reference: (null) [n0:01596] orte_checkpoint: hnp_receiver: Receive a command message. [n0:01596] orte_checkpoint: hnp_receiver: Status Update. [n0:01596] Pending - Global Snapshot Reference: (null) [n0:01596] orte_checkpoint: hnp_receiver: Receive a command message. [n0:01596] orte_checkpoint: hnp_receiver: Status Update. [n0:01596] Running - Global Snapshot Reference: (null) Note: It hangs here ------------------------------*******************************--------------------- Command used to launch program is /usr/local/openmpi-1.3.1/install/bin/mpirun -np 2 -H n5 -am ft-enable-cr --mca btl tcp,self a.out And the dummy program is pretty simple as follows #include<time.h> #include<stdio.h> #include<mpi.h> #define LIMIT 10000000 main(int argc,char ** argv) { int i; int my_rank; /* Rank of process */ int np; /* Number of process */ MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&my_rank); MPI_Comm_size(MPI_COMM_WORLD, &np); for(i=0; i<=LIMIT; i++) { printf("n HELLO %d",i); //sleep(10); MPI_Barrier(MPI_COMM_WORLD); } MPI_Finalize(); } Let me know, what could be the error. I feel there is the error in MPI process coordination. Regards Neeraj Chourasia Member of Technical Staff Computational Research Laboratories Limited (A wholly Owned Subsidiary of TATA SONS Ltd) P: +91.9890003757 =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. The sender does not accept liability for any errors or omissions.Thank you =====-----=====-----=====