For anyone following this thread. I am following up with Hiep
offline. I'll reply back to the list once the issue is resolved.
-- Josh
On Oct 3, 2007, at 11:11 AM, Hiep Bui Hoang wrote:
Hi,
I had found that the problem is the firewall on one of my
computers. When I set firewall allow to connect with orther
computer through tcp with port from 1024 to 4999, it is ok, there
is no error about connection. But I still can not checkpoint and
restart my program.
The error is:
$ mpirun -np 3 -host 172.28.11.40, 172.28.11.28,172.28.11.18 -am ft-
enable-cr send_recv
$ ompi-checkpoint 5693
--------------------------------------------
Error: The application (PID = 5693) failed to checkpoint properly.
Returned -1.
----------------------------------------------------------------------
----
There is only one local snapshot created on the computer where I
run command mpirun and ompi-checkpoint, and after create that local
snapshot the checkpoint is terminated with above error.
Some body help me to solve that error!
Thanks.
On 10/2/07, Hiep Bui Hoang <bhoangh...@gmail.com> wrote:
Hi,
I had setup Open MPI "trunk_16171" for 3 computers with Lan
connection, and set environment parameters, ssh without typing
password for each node. I use Red Hat Enterprise Linux 5 . The
program I tried is 'send_recv'. I run successful my 'send_recv'
program in those 3 nodes. And checkpoint/restart successful on 1
node. But I had error when try to checkpoint/restart that program
on 3 nodes.
$ mpirun -np 4 -host 172.28.11.40, 172.28.11.28,172.28.11.18 -
am ft-enable-cr send_recv
....
Send 32 from rank 0
Receive 32 at rank 1
Send 33 from rank 0
Receive 33 at rank 1
[HNP:05700] [1,2]-[1,3] mca_oob_tcp_peer_try_connect: connect to
172.28.11.40:3680 failed: Software caused connection abort (103)
[HNP:05700] [1,2]-[1,3] mca_oob_tcp_peer_try_connect: connect to
172.28.11.40:3680 failed, connecting over all interfaces failed!
[node2:04837] [1,1]-[1,3] mca_oob_tcp_peer_try_connect: connect to
172.28.11.40:3680 failed: Software caused connection abort (103)
[node2:04837] [1,1]-[1,3] mca_oob_tcp_peer_try_connect: connect to
172.28.11.40:3680 failed, connecting over all interfaces failed!
Receive 34 at rank 1
Send 34 from rank 0
.....
PID of above mpirun is 5693.
$ ompi-checkpoint 5693
----------------------------------------------------------------------
----
Error: The application (PID = 5693) failed to checkpoint properly.
Returned -1.
----------------------------------------------------------------------
----
Somebody know about this error?
Thanks.
This is my 'send_recv' program:
main(int argc, char **argv)
{
int node;
int MAX = 1000;
MPI_Status status;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &node);
int i = 0;
while( i <= MAX){
if( 0 == node){
MPI_Send(&i, 1, MPI_INT, 1, 10, MPI_COMM_WORLD);
printf("Send %d from rank %d \n",i, node);
sleep(1);
}
if( 1 == node ){
MPI_Recv(&i, 1, MPI_INT, 0, 10, MPI_COMM_WORLD,
&status);
printf(" Receive %d at rank %d \n",i,node);
sleep(1);
}
i++;
}
MPI_Finalize();
}
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users