Hi, I had found that the problem is the firewall on one of my computers. When I set firewall allow to connect with orther computer through tcp with port from 1024 to 4999, it is ok, there is no error about connection. But I still can not checkpoint and restart my program.
The error is: $ mpirun -np 3 -host 172.28.11.40,172.28.11.28,172.28.11.18 -am ft-enable-cr send_recv $ ompi-checkpoint 5693-------------------------------------------- Error: The application (PID = 5693) failed to checkpoint properly. Returned -1. -------------------------------------------------------------------------- There is only one local snapshot created on the computer where I run command mpirun and ompi-checkpoint, and after create that local snapshot the checkpoint is terminated with above error. Some body help me to solve that error! Thanks. On 10/2/07, Hiep Bui Hoang <bhoangh...@gmail.com> wrote: > > > Hi, > I had setup Open MPI "trunk_16171" for 3 computers with Lan connection, > and set environment parameters, ssh without typing password for each node. I > use Red Hat Enterprise Linux 5 . The program I tried is 'send_recv'. I run > successful my 'send_recv' program in those 3 nodes. And checkpoint/restart > successful on 1 node. But I had error when try to checkpoint/restart that > program on 3 nodes. > > $ mpirun -np 4 -host 172.28.11.40,172.28.11.28,172.28.11.18 -am > ft-enable-cr send_recv > > .... > Send 32 from rank 0 > Receive 32 at rank 1 > Send 33 from rank 0 > Receive 33 at rank 1 > [HNP:05700] [1,2]-[1,3] mca_oob_tcp_peer_try_connect: connect to > 172.28.11.40:3680 > failed: Software caused connection abort (103) > [HNP:05700] [1,2]-[1,3] mca_oob_tcp_peer_try_connect: connect to > 172.28.11.40:3680 failed, connecting over all interfaces failed! > [node2:04837] [1,1]-[1,3] mca_oob_tcp_peer_try_connect: connect to > 172.28.11.40:3680 failed: Software caused connection abort (103) > [node2:04837] [1,1]-[1,3] mca_oob_tcp_peer_try_connect: connect to > 172.28.11.40:3680 failed, connecting over all interfaces failed! > Receive 34 at rank 1 > Send 34 from rank 0 > ..... > > PID of above mpirun is 5693. > $ ompi-checkpoint 5693 > -------------------------------------------------------------------------- > Error: The application (PID = 5693) failed to checkpoint properly. > Returned -1. > > -------------------------------------------------------------------------- > > > Somebody know about this error? > Thanks. > > This is my 'send_recv' program: > > main(int argc, char **argv) > { > int node; > int MAX = 1000; > MPI_Status status; > MPI_Init(&argc,&argv); > MPI_Comm_rank(MPI_COMM_WORLD, &node); > > int i = 0; > while( i <= MAX){ > if( 0 == node){ > MPI_Send(&i, 1, MPI_INT, 1, 10, MPI_COMM_WORLD); > printf("Send %d from rank %d \n",i, node); > sleep(1); > } > if( 1 == node ){ > MPI_Recv(&i, 1, MPI_INT, 0, 10, MPI_COMM_WORLD, > &status); > printf(" Receive %d at rank %d \n",i,node); > sleep(1); > } > i++; > } > MPI_Finalize(); > } > >