Sorry for the long delay to respond.
It is a bit odd that the hang does not occur when running on only one
host. I suspect that is more due to timing than anything else.
I am not able to reproduce the hang at the moment, but I do get an
occasional datatype copy error which could be symptomatic of a related
problem. I'll dig into this a bit more this week and let you know when
I have a fix and if I can reproduce the hang.
Thanks for the bug report.
Cheers,
Josh
On Apr 10, 2009, at 4:56 AM, nee...@crlindia.com wrote:
Dear All,
I am trying to checkpoint a test application using openmpi-1.3.1,
but fails to do so, when run multiple process on different nodes.
Checkpointing runs fine, if process is running on the same node
along with mpirun process. But the moment i launch MPI process from
different node, it hangs.
ex.
mpirun -np 2 ./test (will checkpoint fine using ompi-checkpoint -
v <mpirun_pid> )
but
mpirun -np 2 -H host1 ./test (Checkpointing will hang)
Similarly
mpirun -np 2 -H localhost,host1 ./test would still hangs while
checkpointing.
Please find the output which comes while checkpointing
--------------
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx----------------------------
[n0:01596] orte_checkpoint: Checkpointing...
[n0:01596] PID 1514
[n0:01596] Connected to Mpirun [[11946,0],0]
[n0:01596] orte_checkpoint: notify_hnp: Contact Head Node Process
PID 1514
[n0:01596] orte_checkpoint: notify_hnp: Requested a checkpoint of
jobid [INVALID]
[n0:01596] orte_checkpoint: hnp_receiver: Receive a command message.
[n0:01596] orte_checkpoint: hnp_receiver: Status Update.
[n0:01596] Requested - Global Snapshot Reference:
(null)
[n0:01596] orte_checkpoint: hnp_receiver: Receive a command message.
[n0:01596] orte_checkpoint: hnp_receiver: Status Update.
[n0:01596] Pending - Global Snapshot Reference:
(null)
[n0:01596] orte_checkpoint: hnp_receiver: Receive a command message.
[n0:01596] orte_checkpoint: hnp_receiver: Status Update.
[n0:01596] Running - Global Snapshot Reference:
(null)
Note: It hangs here
------------------------------
*******************************---------------------
Command used to launch program is
/usr/local/openmpi-1.3.1/install/bin/mpirun -np 2 -H n5 -am ft-
enable-cr --mca btl tcp,self a.out
And the dummy program is pretty simple as follows
#include<time.h>
#include<stdio.h>
#include<mpi.h>
#define LIMIT 10000000
main(int argc,char ** argv)
{
int i;
int my_rank; /* Rank of process */
int np; /* Number of process */
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &np);
for(i=0; i<=LIMIT; i++)
{
printf("n HELLO %d",i);
//sleep(10);
MPI_Barrier(MPI_COMM_WORLD);
}
MPI_Finalize();
}
Let me know, what could be the error. I feel there is the error in
MPI process coordination.
Regards
Neeraj Chourasia
Member of Technical Staff
Computational Research Laboratories Limited
(A wholly Owned Subsidiary of TATA SONS Ltd)
P: +91.9890003757
=====-----=====-----===== Notice: The information contained in this
e-mail message and/or attachments to it may contain confidential or
privileged information. If you are not the intended recipient, any
dissemination, use, review, distribution, printing or copying of the
information contained in this e-mail message and/or attachments to
it are strictly prohibited. If you have received this communication
in error, please notify us by reply e-mail or telephone and
immediately and permanently delete the message and any attachments.
Internet communications cannot be guaranteed to be timely, secure,
error or virus-free. The sender does not accept liability for any
errors or omissions.Thank you =====-----=====-----=====
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users