I still have not been able to reproduce the hang, but I'm still looking into it.

I did commit a fix for the datatype copy error that I mentioned (r21080 in the Open MPI trunk, and it is in the pipeline for v1.3).

Can you put in a print statement before MPI_Finalize, then try the program again? I am wondering if the problem is not with the MPI_Barrier, but MPI_Finalize. I wonder if one (or both) of the processes enter MPI_Finalize while a checkpoint is occurring. Unfortunately, I have not tested the MPI_Finalize scenario in a long time, but will put that on my todo list.

Cheers,
Josh

On Apr 27, 2009, at 9:48 AM, Josh Hursey wrote:

Sorry for the long delay to respond.

It is a bit odd that the hang does not occur when running on only one host. I suspect that is more due to timing than anything else.

I am not able to reproduce the hang at the moment, but I do get an occasional datatype copy error which could be symptomatic of a related problem. I'll dig into this a bit more this week and let you know when I have a fix and if I can reproduce the hang.

Thanks for the bug report.

Cheers,
Josh

On Apr 10, 2009, at 4:56 AM, nee...@crlindia.com wrote:


Dear All,

I am trying to checkpoint a test application using openmpi-1.3.1, but fails to do so, when run multiple process on different nodes.

Checkpointing runs fine, if process is running on the same node along with mpirun process. But the moment i launch MPI process from different node, it hangs.

ex.
mpirun -np 2 ./test (will checkpoint fine using ompi-checkpoint - v <mpirun_pid> )
 but
 mpirun -np 2 -H host1 ./test (Checkpointing will hang)

Similarly
mpirun -np 2 -H localhost,host1 ./test would still hangs while checkpointing.

Please find the output which comes while checkpointing

-------------- xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx----------------------------
[n0:01596] orte_checkpoint: Checkpointing...
[n0:01596]       PID 1514
[n0:01596]       Connected to Mpirun [[11946,0],0]
[n0:01596] orte_checkpoint: notify_hnp: Contact Head Node Process PID 1514 [n0:01596] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID]
[n0:01596] orte_checkpoint: hnp_receiver: Receive a command message.
[n0:01596] orte_checkpoint: hnp_receiver: Status Update.
[n0:01596] Requested - Global Snapshot Reference: (null)
[n0:01596] orte_checkpoint: hnp_receiver: Receive a command message.
[n0:01596] orte_checkpoint: hnp_receiver: Status Update.
[n0:01596] Pending - Global Snapshot Reference: (null)
[n0:01596] orte_checkpoint: hnp_receiver: Receive a command message.
[n0:01596] orte_checkpoint: hnp_receiver: Status Update.
[n0:01596] Running - Global Snapshot Reference: (null)

Note: It hangs here

------------------------------ *******************************---------------------

Command used to launch program is

/usr/local/openmpi-1.3.1/install/bin/mpirun -np 2 -H n5 -am ft- enable-cr --mca btl tcp,self a.out

And the dummy program is pretty simple as follows

#include<time.h>
#include<stdio.h>
#include<mpi.h>


#define LIMIT 10000000

main(int argc,char ** argv)
{
        int i;

           int my_rank; /* Rank of process */
           int np; /* Number of process */


           MPI_Init(&argc,&argv);
           MPI_Comm_rank(MPI_COMM_WORLD,&my_rank);
           MPI_Comm_size(MPI_COMM_WORLD, &np);


            for(i=0; i<=LIMIT; i++)
            {
                    printf("n HELLO %d",i);
                       //sleep(10);
                       MPI_Barrier(MPI_COMM_WORLD);
      }
           MPI_Finalize();
}



Let me know, what could be the error. I feel there is the error in MPI process coordination.

Regards


Neeraj Chourasia
Member of Technical Staff
Computational Research Laboratories Limited
(A wholly Owned Subsidiary of TATA SONS Ltd)
P: +91.9890003757

=====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. The sender does not accept liability for any errors or omissions.Thank you =====-----=====----- =====

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to