Dear All,

   I am trying to checkpoint a test application using openmpi-1.3.1, but 
fails to do so, when run multiple process on different nodes.

 Checkpointing runs fine, if process is running on the same node along 
with mpirun process. But the moment i launch MPI process from different 
node, it hangs.

 ex.
   mpirun -np 2 ./test (will checkpoint fine using ompi-checkpoint -v 
<mpirun_pid> )
  but
  mpirun -np 2 -H host1 ./test (Checkpointing will hang)

Similarly
  mpirun -np 2 -H localhost,host1 ./test would still hangs while 
checkpointing.

Please find the output which comes while checkpointing

--------------xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx----------------------------
[n0:01596] orte_checkpoint: Checkpointing... 
[n0:01596]       PID 1514 
[n0:01596]       Connected to Mpirun [[11946,0],0] 
[n0:01596] orte_checkpoint: notify_hnp: Contact Head Node Process PID 1514 

[n0:01596] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid 
[INVALID] 
[n0:01596] orte_checkpoint: hnp_receiver: Receive a command message. 
[n0:01596] orte_checkpoint: hnp_receiver: Status Update. 
[n0:01596]                 Requested - Global Snapshot Reference: (null) 
[n0:01596] orte_checkpoint: hnp_receiver: Receive a command message. 
[n0:01596] orte_checkpoint: hnp_receiver: Status Update. 
[n0:01596]                   Pending - Global Snapshot Reference: (null) 
[n0:01596] orte_checkpoint: hnp_receiver: Receive a command message. 
[n0:01596] orte_checkpoint: hnp_receiver: Status Update. 
[n0:01596]                   Running - Global Snapshot Reference: (null) 

Note: It hangs here 

------------------------------*******************************---------------------

Command used to launch program is 

/usr/local/openmpi-1.3.1/install/bin/mpirun -np 2 -H n5 -am ft-enable-cr 
--mca btl tcp,self a.out

And the dummy program is pretty simple as follows

#include<time.h> 
#include<stdio.h> 
#include<mpi.h> 


#define LIMIT 10000000 

main(int argc,char ** argv) 
{ 
        int i; 

            int my_rank; /* Rank of process */ 
            int np; /* Number of process */ 


            MPI_Init(&argc,&argv); 
            MPI_Comm_rank(MPI_COMM_WORLD,&my_rank); 
            MPI_Comm_size(MPI_COMM_WORLD, &np); 


             for(i=0; i<=LIMIT; i++) 
             { 
                printf("n HELLO %d",i); 
                        //sleep(10); 
                        MPI_Barrier(MPI_COMM_WORLD); 
        } 
            MPI_Finalize(); 
} 



Let me know, what could be the error. I feel there is the error in MPI 
process coordination.

Regards


Neeraj Chourasia
Member of Technical Staff
Computational Research Laboratories Limited
(A wholly Owned Subsidiary of TATA SONS Ltd)
P: +91.9890003757


=====-----=====-----=====



Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. 

Internet communications cannot be guaranteed to be timely,
secure, error or virus-free. The sender does not accept liability
for any errors or omissions.Thank you

=====-----=====-----=====

Reply via email to