Hi,
I had setup Open MPI "trunk_16171" for 3 computers with Lan connection, and
set environment parameters, ssh without typing password for each node. I use
Red Hat Enterprise Linux 5. The program I tried is 'send_recv'. I run
successful my 'send_recv' program in those 3 nodes. And checkpoint/restart
successful on 1 node. But I had error when try to checkpoint/restart that
program on 3 nodes.

    $ mpirun -np 4 -host 172.28.11.40,172.28.11.28,172.28.11.18 -am
ft-enable-cr send_recv

....
Send 32 from rank 0
 Receive 32 at rank 1
Send 33 from rank 0
 Receive 33 at rank 1
[HNP:05700] [1,2]-[1,3] mca_oob_tcp_peer_try_connect: connect to
172.28.11.40:3680 failed: Software caused connection abort (103)
[HNP:05700] [1,2]-[1,3] mca_oob_tcp_peer_try_connect: connect to
172.28.11.40:3680 failed, connecting over all interfaces failed!
[node2:04837] [1,1]-[1,3] mca_oob_tcp_peer_try_connect: connect to
172.28.11.40:3680 failed: Software caused connection abort (103)
[node2:04837] [1,1]-[1,3] mca_oob_tcp_peer_try_connect: connect to
172.28.11.40:3680 failed, connecting over all interfaces failed!
 Receive 34 at rank 1
Send 34 from rank 0
.....

PID of above mpirun is 5693.
    $ ompi-checkpoint 5693
--------------------------------------------------------------------------
Error: The application (PID = 5693) failed to checkpoint properly.
       Returned -1.

--------------------------------------------------------------------------

Somebody know about this error?
Thanks.

This is my 'send_recv' program:

main(int argc, char **argv)
{
   int node;
   int MAX = 1000;
   MPI_Status status;
   MPI_Init(&argc,&argv);
   MPI_Comm_rank(MPI_COMM_WORLD, &node);

   int    i = 0;
   while( i <= MAX){
    if( 0 == node){
        MPI_Send(&i, 1, MPI_INT, 1, 10, MPI_COMM_WORLD);
        printf("Send %d from rank %d \n",i, node);
        sleep(1);
    }
    if( 1 == node ){
        MPI_Recv(&i, 1, MPI_INT, 0, 10, MPI_COMM_WORLD,
                     &status);
        printf(" Receive %d at rank %d \n",i,node);
        sleep(1);
    }
    i++;
   }
   MPI_Finalize();
}

Reply via email to