Hi, I had setup Open MPI "trunk_16171" for 3 computers with Lan connection, and set environment parameters, ssh without typing password for each node. I use Red Hat Enterprise Linux 5. The program I tried is 'send_recv'. I run successful my 'send_recv' program in those 3 nodes. And checkpoint/restart successful on 1 node. But I had error when try to checkpoint/restart that program on 3 nodes.
$ mpirun -np 4 -host 172.28.11.40,172.28.11.28,172.28.11.18 -am ft-enable-cr send_recv .... Send 32 from rank 0 Receive 32 at rank 1 Send 33 from rank 0 Receive 33 at rank 1 [HNP:05700] [1,2]-[1,3] mca_oob_tcp_peer_try_connect: connect to 172.28.11.40:3680 failed: Software caused connection abort (103) [HNP:05700] [1,2]-[1,3] mca_oob_tcp_peer_try_connect: connect to 172.28.11.40:3680 failed, connecting over all interfaces failed! [node2:04837] [1,1]-[1,3] mca_oob_tcp_peer_try_connect: connect to 172.28.11.40:3680 failed: Software caused connection abort (103) [node2:04837] [1,1]-[1,3] mca_oob_tcp_peer_try_connect: connect to 172.28.11.40:3680 failed, connecting over all interfaces failed! Receive 34 at rank 1 Send 34 from rank 0 ..... PID of above mpirun is 5693. $ ompi-checkpoint 5693 -------------------------------------------------------------------------- Error: The application (PID = 5693) failed to checkpoint properly. Returned -1. -------------------------------------------------------------------------- Somebody know about this error? Thanks. This is my 'send_recv' program: main(int argc, char **argv) { int node; int MAX = 1000; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &node); int i = 0; while( i <= MAX){ if( 0 == node){ MPI_Send(&i, 1, MPI_INT, 1, 10, MPI_COMM_WORLD); printf("Send %d from rank %d \n",i, node); sleep(1); } if( 1 == node ){ MPI_Recv(&i, 1, MPI_INT, 0, 10, MPI_COMM_WORLD, &status); printf(" Receive %d at rank %d \n",i,node); sleep(1); } i++; } MPI_Finalize(); }