The problem is not with your setup, but with a limitation of the current checkpoint/restart implementation in Open MPI. Currently, Open MPI requires that the MPI process be inside the MPI library in order to make progress on a checkpoint request. This is because all checkpoint coordination in Open MPI occurs in the same thread as the main MPI process. So if you made a call to an MPI function in your loop, the checkpoint will be able to make progress.

This is different that LAM/MPI checkpoint/restart in which coordination occurred in a concurrent thread, therefore the application does not need to be in the MPI library in order to make progress on a checkpoint request.

I am currently developing a solution for this, but I can't say when it will be completed. Hopefully before the v1.3 release.

Sorry I don't have a better solution at the moment.

Best,
Josh

On Aug 22, 2007, at 2:27 AM, Hiep Bui Hoang wrote:

Hi,
I had compiled and installed Open MPI with C/R support in the way Josh said. When finished, Open MPI had support and tools for C/R: ompi-checkpoint, ompi-restart. And I try an example ( hello_c.c in examples folder, but I edit it with a for loop to print out "Hello..." 1,000,000 times)
But I get this error:
Error: The application (PID = 23573) failed to checkpoint properly.
       Returned -1.

The steps what I had do:
         # mpicc hello_c.c -o hello
         # mpirun -np 4 -am ft-enable-cr hello

    I get PID of this mpirun with another shell and do:
         # ompi-checkpoint 23573
Error: The application (PID = 23573) failed to checkpoint properly.
                 Returned -1.

What's wrong with this error?
Could you help me an example about using C/R in Open MPI?
Hiep

hello_c.c
#include <stdio.h>
#include "mpi.h"
int main(int argc, char* argv[])
{
    int rank, size, i;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    for(i=0; i<1000000; i++){
            printf("%d Hello, world, I am %d of %d\n",i,rank, size);
    }
    MPI_Barrier(MPI_COMM_WORLD);
    MPI_Finalize();
    return 0;
}



On 8/22/07, Josh Hursey < jjhur...@open-mpi.org> wrote:Hello,

There are a few things you need to do to build Open MPI with
Checkpoint/Restart support. By default Open MPI is configured without
checkpoint/restart support.
1) Make sure you have BLCR successfully installed and loaded on your
system(s)
2) configure Open MPI with the "--with-ft=cr" option, which enables
checkpoint/restart fault tolerance
  Note: you may also have to specify the install directory of BLCR
with the "--with-blcr=/path/to/blcr"
3) make and make install

The resultant build will have support for checkpoint/restart and the
tools (e.g., ompi-checkpoint, ompi-restart) will become available.

Looking at the documentation it doesn't seem to include these steps.
I'll fix that later today, and post and updated file to the wiki.
Sorry about that. :(

Hope this helps,
Josh

On Aug 21, 2007, at 1:09 PM, Hiep Bui Hoang wrote:

> Hello,
> I'm Hiep, I'm trying to use checkpoint/restart feature in Open MPI.
> I had read information about this feature in  https://svn.open-
> mpi.org/trac/ompi/wiki/ProcessFT_CR and Open-MPI-FT-CR-Draft-
> v1.pdf. I had built Open MPI from "trunk" which gotten by Subversion.
> But I don't know how to enable checkpoint/restart fault tolerance
> in Open MPI.
> So that, I get this error when I try this command: ompi-checkpoint.
>        bash: ompi-checkpoint: command not found
> I want to ask you how to build and use checkpoint/restart feature
> in Open MPI.
> Please tell me in details, I'm a new user.
> Thanks!
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to