This problem was caused by a couple of things.

First is a problem with the default MCA parameters. By default the global and local snapshot directories are '/tmp', and the mode of file transfer is 'in_place'. 'in_place' file transfer assumes that the global snapshot directory points to an NFS mounted directory that all machines can access. Typically '/tmp' is not such a directory. :(

I'll likely change the defaults (in the next day or so) to make the default global snapshot directory $HOME or $CWD. Of course all of this behavior can be changed by modifying the MCA parameters for the global and local snapshot directories and the transfer mechanism. The MCA parameters in question are described in the Checkpoint/Restart users guide located at the link below:
  https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR

Once we got around this problem then we discovered a problem with restarting on a local machine without the aid of a resource manager (e.g., SLURM, Torque, etc.). This bug was fixed in r16433.

The combination of these two items fixed the problems that Hiep was experiencing.

-- Josh

On Oct 10, 2007, at 11:04 AM, Josh Hursey wrote:

For anyone following this thread. I am following up with Hiep
offline. I'll reply back to the list once the issue is resolved.

-- Josh

On Oct 3, 2007, at 11:11 AM, Hiep Bui Hoang wrote:

Hi,
I had found that the problem is the firewall on one of my
computers. When I set firewall allow to connect with orther
computer through tcp with port from 1024 to 4999, it is ok, there
is no error about connection. But I still can not checkpoint and
restart my program.

The error is:
$ mpirun -np 3 -host 172.28.11.40, 172.28.11.28,172.28.11.18 -am ft-
enable-cr send_recv
$ ompi-checkpoint 5693
--------------------------------------------
Error: The application (PID = 5693) failed to checkpoint properly.
       Returned -1.

--------------------------------------------------------------------- -
----

There is only one local snapshot created on the computer where I
run command mpirun and ompi-checkpoint, and after create that local
snapshot the checkpoint is terminated with above error.
Some body help me to solve that error!
Thanks.

On 10/2/07, Hiep Bui Hoang <bhoangh...@gmail.com> wrote:
Hi,
I had setup Open MPI "trunk_16171" for 3 computers with Lan
connection, and set environment parameters, ssh without typing
password for each node. I use Red Hat Enterprise Linux 5 . The
program I tried is 'send_recv'. I run successful my 'send_recv'
program in those 3 nodes. And checkpoint/restart successful on 1
node. But I had error when try to checkpoint/restart that program
on 3 nodes.

    $ mpirun -np 4 -host 172.28.11.40, 172.28.11.28,172.28.11.18 -
am ft-enable-cr send_recv

....
Send 32 from rank 0
 Receive 32 at rank 1
Send 33 from rank 0
 Receive 33 at rank 1
[HNP:05700] [1,2]-[1,3] mca_oob_tcp_peer_try_connect: connect to
172.28.11.40:3680 failed: Software caused connection abort (103)
[HNP:05700] [1,2]-[1,3] mca_oob_tcp_peer_try_connect: connect to
172.28.11.40:3680 failed, connecting over all interfaces failed!
[node2:04837] [1,1]-[1,3] mca_oob_tcp_peer_try_connect: connect to
172.28.11.40:3680 failed: Software caused connection abort (103)
[node2:04837] [1,1]-[1,3] mca_oob_tcp_peer_try_connect: connect to
172.28.11.40:3680 failed, connecting over all interfaces failed!
 Receive 34 at rank 1
Send 34 from rank 0
.....

PID of above mpirun is 5693.
    $ ompi-checkpoint 5693
--------------------------------------------------------------------- -
----
Error: The application (PID = 5693) failed to checkpoint properly.
       Returned -1.

--------------------------------------------------------------------- -
----

Somebody know about this error?
Thanks.

This is my 'send_recv' program:

main(int argc, char **argv)
{
   int node;
   int MAX = 1000;
   MPI_Status status;
   MPI_Init(&argc,&argv);
   MPI_Comm_rank(MPI_COMM_WORLD, &node);

   int    i = 0;
   while( i <= MAX){
    if( 0 == node){
        MPI_Send(&i, 1, MPI_INT, 1, 10, MPI_COMM_WORLD);
        printf("Send %d from rank %d \n",i, node);
        sleep(1);
    }
    if( 1 == node ){
        MPI_Recv(&i, 1, MPI_INT, 0, 10, MPI_COMM_WORLD,
                     &status);
        printf(" Receive %d at rank %d \n",i,node);
        sleep(1);
    }
    i++;
   }
   MPI_Finalize();
}


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to