On Jun 18, 2009, at 7:33 PM, Kritiraj Sajadah wrote:


Hello Josh,
ThanK you again for your respond. I tried chekpointing a simple c program using BLCR...and got the same error, i.e:

- vfs_write returned -14
- file_header: write returned -14
Checkpoint failed: Bad address

So I would look at how your NFS file system is setup, and work with your sysadmin (and maybe the BLCR list) to resolve this before experimenting too much with checkpointing with Open MPI.


This is how i installed and run mpi programs for checkpointing:

1) configure and install blcr
2) configure and install openmpi
3)  Compile and run mpi program as follows:
4) To checkpoint the running program,
5) To restart your checkpoint, locate the checkpoint file and type the following from the command line:


This all looks ok to me.

The did another test with BLCR however,

I tried checkpointing my c application from the /tmp directory instead of my $HOME directory and it checkpointed fine.

So, it looks like the problem is with my $HOME directory.

I have "drwx" rights on my $HOME directory which seems fine for me.

Then i tried it with open MPI. However, with open mpi the checkpoint file automatically get saved in the $HOME directory.

Is there a way to have the file saved in a different location? I checked that LAM/MPI has some command line options :

$ mpirun -np 2 -ssi cr_base_dir /somewhere/else a.out

Do we have a similar option for open mpi?

By default Open MPI places the global snapshot in the $HOME directory. But you can also specify a different directory for the global snapshot using the following MCA option:
  -mca snapc_base_global_snapshot_dir /somewhere/else

For the best results you will likely want to set this in the MCA params file in your home directory:
 shell$ cat ~/.openmpi/mca-params.conf
 snapc_base_global_snapshot_dir=/somewhere/else

You can also stage the file to local disk, then have Open MPI transfer the checkpoints back to a {logically} central storage device (both can be /tmp on a local disk if you like). For more details on this and the above option you will want to read through the FT Users Guide attached to the wiki page at the link below:
  https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR

-- Josh



Thanks a lot

regards,

Raj

--- On Wed, 6/17/09, Josh Hursey <jjhur...@open-mpi.org> wrote:

From: Josh Hursey <jjhur...@open-mpi.org>
Subject: Re: [OMPI users] vfs_write returned -14
To: "Open MPI Users" <us...@open-mpi.org>
Date: Wednesday, June 17, 2009, 1:42 AM
Did you try checkpointing a non-MPI
application with BLCR on the
cluster? If that does not work then I would suspect that
BLCR is not
working properly on the system.

However if a non-MPI application can be checkpointed and
restarted
correctly on this machine then it may be something odd with
the Open
MPI installation or runtime environment. To help debug here
I would
need to know how Open MPI was configured and how the
application was
ran on the machine (command line arguments, environment
variables, ...).

I should note that for the program that you sent it is
important that
you compile Open MPI with the Fault Tolerance Thread
enabled to ensure
a timely checkpoint. Otherwise the checkpoint will be
delayed until
the MPI program enters the MPI_Finalize function.

Let me know what you find out.

Josh

On Jun 16, 2009, at 5:08 PM, Kritiraj Sajadah wrote:


Hi Josh,

Thanks for the email. I have install BLCR 0.8.1 and
openmpi 1.3 on
my laptop with Ubuntu 8.04 on it. It works fine.

I now tried the installation on the cluster ( on one
machine for
now) in my university. ( the administrator installed
it) i am not
sure if he followed the steps i gave him.

I am checkpointing a simple mpi application which
looks as follows:

#include <mpi.h>
#include <stdio.h>

int main(int argc, char **argv)
{
int rank,size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("I am processor no %d of a total of %d procs
\n", rank, size);
system("sleep 30");
printf("I am processor no %d of a total of %d procs
\n", rank, size);
system("sleep 30");
printf("I am processor no %d of a total of %d procs
\n", rank, size);
system("sleep 30");
printf("bye \n");
MPI_Finalize();
return 0;
}

Do you think its better to re install BLCR?


Thanks

Raj
--- On Tue, 6/16/09, Josh Hursey <jjhur...@open-mpi.org>
wrote:

From: Josh Hursey <jjhur...@open-mpi.org>
Subject: Re: [OMPI users] vfs_write returned -14
To: "Open MPI Users" <us...@open-mpi.org>
Date: Tuesday, June 16, 2009, 6:42 PM

These are errors from BLCR. It may be a problem
with your
BLCR installation and/or your application. Are you
able to
checkpoint/restart a non-MPI application with BLCR
on these
machines?

What kind of MPI application are you trying to
checkpoint?
Some of the MPI interfaces are not fully supported
at the
moment (outlined in the FT User Document that I
mentioned in
a previous email).

-- Josh

On Jun 16, 2009, at 11:30 AM, Kritiraj Sajadah
wrote:


Dear All,
           I
have install
openmpi 1.3 and blcr 0.8.1 on a linux machine
(ubuntu).
however, when i try checkpointing an MPI
application, I get
the following error:

- vfs_write returned -14
- file_header: write returned -14

Can someone help please.

Regards,

Raj






_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to