From: Josh Hursey <jjhur...@open-mpi.org>
Subject: Re: [OMPI users] vfs_write returned -14
To: "Open MPI Users" <us...@open-mpi.org>
Date: Friday, June 19, 2009, 2:48 PM
On Jun 18, 2009, at 7:33 PM, Kritiraj Sajadah wrote:
Hello Josh,
ThanK you
again for your respond. I tried chekpointing a
simple c program using BLCR...and got the same error,
i.e:
- vfs_write returned -14
- file_header: write returned -14
Checkpoint failed: Bad address
So I would look at how your NFS file system is setup, and
work with
your sysadmin (and maybe the BLCR list) to resolve this
before
experimenting too much with checkpointing with Open MPI.
This is how i installed and run mpi programs for
checkpointing:
1) configure and install blcr
2) configure and install openmpi
3) Compile and run mpi program as follows:
4) To checkpoint the running program,
5) To restart your checkpoint, locate the checkpoint
file and type
the following from the command line:
This all looks ok to me.
The did another test with BLCR however,
I tried checkpointing my c application from the /tmp
directory
instead of my $HOME directory and it checkpointed
fine.
So, it looks like the problem is with my $HOME
directory.
I have "drwx" rights on my $HOME directory which seems
fine for me.
Then i tried it with open MPI. However, with
open mpi the
checkpoint file automatically get saved in the $HOME
directory.
Is there a way to have the file saved in a different
location? I
checked that LAM/MPI has some command line
options :
$ mpirun -np 2 -ssi cr_base_dir /somewhere/else a.out
Do we have a similar option for open mpi?
By default Open MPI places the global snapshot in the $HOME
directory.
But you can also specify a different directory for the
global snapshot
using the following MCA option:
-mca snapc_base_global_snapshot_dir
/somewhere/else
For the best results you will likely want to set this in
the MCA
params file in your home directory:
shell$ cat ~/.openmpi/mca-params.conf
snapc_base_global_snapshot_dir=/somewhere/else
You can also stage the file to local disk, then have Open
MPI transfer
the checkpoints back to a {logically} central storage
device (both can
be /tmp on a local disk if you like). For more details on
this and the
above option you will want to read through the FT Users
Guide attached
to the wiki page at the link below:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR
-- Josh
Thanks a lot
regards,
Raj
--- On Wed, 6/17/09, Josh Hursey <jjhur...@open-mpi.org>
wrote:
From: Josh Hursey <jjhur...@open-mpi.org>
Subject: Re: [OMPI users] vfs_write returned -14
To: "Open MPI Users" <us...@open-mpi.org>
Date: Wednesday, June 17, 2009, 1:42 AM
Did you try checkpointing a non-MPI
application with BLCR on the
cluster? If that does not work then I would
suspect that
BLCR is not
working properly on the system.
However if a non-MPI application can be
checkpointed and
restarted
correctly on this machine then it may be something
odd with
the Open
MPI installation or runtime environment. To help
debug here
I would
need to know how Open MPI was configured and how
the
application was
ran on the machine (command line arguments,
environment
variables, ...).
I should note that for the program that you sent
it is
important that
you compile Open MPI with the Fault Tolerance
Thread
enabled to ensure
a timely checkpoint. Otherwise the checkpoint will
be
delayed until
the MPI program enters the MPI_Finalize function.
Let me know what you find out.
Josh
On Jun 16, 2009, at 5:08 PM, Kritiraj Sajadah
wrote:
Hi Josh,
Thanks for the email. I have install BLCR
0.8.1 and
openmpi 1.3 on
my laptop with Ubuntu 8.04 on it. It works
fine.
I now tried the installation on the cluster (
on one
machine for
now) in my university. ( the administrator
installed
it) i am not
sure if he followed the steps i gave him.
I am checkpointing a simple mpi application
which
looks as follows:
#include <mpi.h>
#include <stdio.h>
int main(int argc, char **argv)
{
int rank,size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("I am processor no %d of a total of %d
procs
\n", rank, size);
system("sleep 30");
printf("I am processor no %d of a total of %d
procs
\n", rank, size);
system("sleep 30");
printf("I am processor no %d of a total of %d
procs
\n", rank, size);
system("sleep 30");
printf("bye \n");
MPI_Finalize();
return 0;
}
Do you think its better to re install BLCR?
Thanks
Raj
--- On Tue, 6/16/09, Josh Hursey <jjhur...@open-mpi.org>
wrote:
From: Josh Hursey <jjhur...@open-mpi.org>
Subject: Re: [OMPI users] vfs_write
returned -14
To: "Open MPI Users" <us...@open-mpi.org>
Date: Tuesday, June 16, 2009, 6:42 PM
These are errors from BLCR. It may be a
problem
with your
BLCR installation and/or your application.
Are you
able to
checkpoint/restart a non-MPI application
with BLCR
on these
machines?
What kind of MPI application are you
trying to
checkpoint?
Some of the MPI interfaces are not fully
supported
at the
moment (outlined in the FT User Document
that I
mentioned in
a previous email).
-- Josh
On Jun 16, 2009, at 11:30 AM, Kritiraj
Sajadah
wrote:
Dear All,
I
have install
openmpi 1.3 and blcr 0.8.1 on a linux
machine
(ubuntu).
however, when i try checkpointing an MPI
application, I get
the following error:
- vfs_write returned -14
- file_header: write returned -14
Can someone help please.
Regards,
Raj
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users