Hi,
I had found that the problem is the firewall on one of my computers. When I
set firewall allow to connect with orther computer through tcp with port
from 1024 to 4999, it is ok, there is no error about connection. But I still
can not checkpoint and restart my program.

The error is:
$ mpirun -np 3 -host 172.28.11.40,172.28.11.28,172.28.11.18 -am ft-enable-cr
send_recv
$ ompi-checkpoint 5693--------------------------------------------
Error: The application (PID = 5693) failed to checkpoint properly.
       Returned -1.

--------------------------------------------------------------------------

There is only one local snapshot created on the computer where I run command
mpirun and ompi-checkpoint, and after create that local snapshot the
checkpoint is terminated with above error.
Some body help me to solve that error!
Thanks.

On 10/2/07, Hiep Bui Hoang <bhoangh...@gmail.com> wrote:
>
>
> Hi,
> I had setup Open MPI "trunk_16171" for 3 computers with Lan connection,
> and set environment parameters, ssh without typing password for each node. I
> use Red Hat Enterprise Linux 5 . The program I tried is 'send_recv'. I run
> successful my 'send_recv' program in those 3 nodes. And checkpoint/restart
> successful on 1 node. But I had error when try to checkpoint/restart that
> program on 3 nodes.
>
>     $ mpirun -np 4 -host 172.28.11.40,172.28.11.28,172.28.11.18 -am
> ft-enable-cr send_recv
>
> ....
> Send 32 from rank 0
>  Receive 32 at rank 1
> Send 33 from rank 0
>  Receive 33 at rank 1
> [HNP:05700] [1,2]-[1,3] mca_oob_tcp_peer_try_connect: connect to 
> 172.28.11.40:3680
> failed: Software caused connection abort (103)
> [HNP:05700] [1,2]-[1,3] mca_oob_tcp_peer_try_connect: connect to
> 172.28.11.40:3680 failed, connecting over all interfaces failed!
> [node2:04837] [1,1]-[1,3] mca_oob_tcp_peer_try_connect: connect to
> 172.28.11.40:3680 failed: Software caused connection abort (103)
> [node2:04837] [1,1]-[1,3] mca_oob_tcp_peer_try_connect: connect to
> 172.28.11.40:3680 failed, connecting over all interfaces failed!
>  Receive 34 at rank 1
> Send 34 from rank 0
> .....
>
> PID of above mpirun is 5693.
>     $ ompi-checkpoint 5693
> --------------------------------------------------------------------------
> Error: The application (PID = 5693) failed to checkpoint properly.
>        Returned -1.
>
> --------------------------------------------------------------------------
>
>
> Somebody know about this error?
> Thanks.
>
> This is my 'send_recv' program:
>
> main(int argc, char **argv)
> {
>    int node;
>    int MAX = 1000;
>    MPI_Status status;
>    MPI_Init(&argc,&argv);
>    MPI_Comm_rank(MPI_COMM_WORLD, &node);
>
>    int    i = 0;
>    while( i <= MAX){
>     if( 0 == node){
>         MPI_Send(&i, 1, MPI_INT, 1, 10, MPI_COMM_WORLD);
>         printf("Send %d from rank %d \n",i, node);
>         sleep(1);
>     }
>     if( 1 == node ){
>         MPI_Recv(&i, 1, MPI_INT, 0, 10, MPI_COMM_WORLD,
>                      &status);
>         printf(" Receive %d at rank %d \n",i,node);
>         sleep(1);
>     }
>     i++;
>    }
>    MPI_Finalize();
> }
>
>

Reply via email to