Hi, I had compiled and installed Open MPI with C/R support in the way Josh said. When finished, Open MPI had support and tools for C/R: ompi-checkpoint, ompi-restart. And I try an example ( hello_c.c in examples folder, but I edit it with a for loop to print out "Hello..." 1,000,000 times) But I get this error: Error: The application (PID = 23573) failed to checkpoint properly. Returned -1.
The steps what I had do: # mpicc hello_c.c -o hello # mpirun -np 4 -am ft-enable-cr hello I get PID of this mpirun with another shell and do: # ompi-checkpoint 23573 Error: The application (PID = 23573) failed to checkpoint properly. Returned -1. What's wrong with this error? Could you help me an example about using C/R in Open MPI? Hiep hello_c.c #include <stdio.h> #include "mpi.h" int main(int argc, char* argv[]) { int rank, size, i; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); for(i=0; i<1000000; i++){ printf("%d Hello, world, I am %d of %d\n",i,rank, size); } MPI_Barrier(MPI_COMM_WORLD); MPI_Finalize(); return 0; } On 8/22/07, Josh Hursey <jjhur...@open-mpi.org> wrote: > > Hello, > > There are a few things you need to do to build Open MPI with > Checkpoint/Restart support. By default Open MPI is configured without > checkpoint/restart support. > 1) Make sure you have BLCR successfully installed and loaded on your > system(s) > 2) configure Open MPI with the "--with-ft=cr" option, which enables > checkpoint/restart fault tolerance > Note: you may also have to specify the install directory of BLCR > with the "--with-blcr=/path/to/blcr" > 3) make and make install > > The resultant build will have support for checkpoint/restart and the > tools (e.g., ompi-checkpoint, ompi-restart) will become available. > > Looking at the documentation it doesn't seem to include these steps. > I'll fix that later today, and post and updated file to the wiki. > Sorry about that. :( > > Hope this helps, > Josh > > On Aug 21, 2007, at 1:09 PM, Hiep Bui Hoang wrote: > > > Hello, > > I'm Hiep, I'm trying to use checkpoint/restart feature in Open MPI. > > I had read information about this feature in https://svn.open- > > mpi.org/trac/ompi/wiki/ProcessFT_CR and Open-MPI-FT-CR-Draft- > > v1.pdf. I had built Open MPI from "trunk" which gotten by Subversion. > > But I don't know how to enable checkpoint/restart fault tolerance > > in Open MPI. > > So that, I get this error when I try this command: ompi-checkpoint. > > bash: ompi-checkpoint: command not found > > I want to ask you how to build and use checkpoint/restart feature > > in Open MPI. > > Please tell me in details, I'm a new user. > > Thanks! > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >