On Jun 20, 2009, at 1:48 PM, Kritiraj Sajadah wrote:


Hi Josh,
Thank you for the email. I can now checkpoint the application on the cluster using OPEN MPI. But I am now facing another problem.

When i tried restarting the checkpoint, nothing happens. I copied the checkpoint file to the $HOME directory and tried restarting it there and got the following error:

- open('/var/cache/nscd/passwd', 0x0) failed: -13
- mmap failed: /var/cache/nscd/passwd
- thaw_threads returned error, aborting. -13
- thaw_threads returned error, aborting. -13
- thaw_threads returned error, aborting. -13
Restart failed: Permission denied

On my laptop it works fine. So, I am assuming its again something to do with my $HOME directory.

This issue is documented in the BLCR FAQ:
  http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#eperm

I would follow the directions there to resolve this issue.


Is it possible to restart the chekpoint from the /tmp directory itself without have to copy it back to the $HOME directory.

The '--preload' or '-p' option to ompi-restart will let you restart a parallel job without a shared file system. I believe that the FT User's Guide outlines this option as well (if it does not let me know and I'll add some text for it).



I s there another way to compile and build openmpi so that everthing happens in the /tmp directory instead of the $HOME directory?

There are no compile time options for this, just the runtime options that I previously mentioned.

Best,
Josh



Thank you

Raj

--- On Fri, 6/19/09, Josh Hursey <jjhur...@open-mpi.org> wrote:

From: Josh Hursey <jjhur...@open-mpi.org>
Subject: Re: [OMPI users] vfs_write returned -14
To: "Open MPI Users" <us...@open-mpi.org>
Date: Friday, June 19, 2009, 2:48 PM

On Jun 18, 2009, at 7:33 PM, Kritiraj Sajadah wrote:


Hello Josh,
           ThanK you
again for your respond. I tried chekpointing a
simple c program using BLCR...and got the same error,
i.e:

- vfs_write returned -14
- file_header: write returned -14
Checkpoint failed: Bad address

So I would look at how your NFS file system is setup, and
work with
your sysadmin (and maybe the BLCR list) to resolve this
before
experimenting too much with checkpointing with Open MPI.


This is how i installed and run mpi programs for
checkpointing:

1) configure and install blcr
2) configure and install openmpi
3)  Compile and run mpi program as follows:
4) To checkpoint the running program,
5) To restart your checkpoint, locate the checkpoint
file and type
the following from the command line:


This all looks ok to me.

The did another test with BLCR however,

I tried checkpointing my c application from the /tmp
directory
instead of my $HOME directory and it checkpointed
fine.

So, it looks like the problem is with my $HOME
directory.

I have "drwx" rights on my $HOME directory which seems
fine for me.

Then i tried it with open MPI.  However, with
open mpi the
checkpoint file automatically get saved in the $HOME
directory.

Is there a way to have the file saved in a different
location? I
checked that LAM/MPI has some command line
options :

$ mpirun -np 2 -ssi cr_base_dir /somewhere/else a.out

Do we have a similar option for open mpi?

By default Open MPI places the global snapshot in the $HOME
directory.
But you can also specify a different directory for the
global snapshot
using the following MCA option:
   -mca snapc_base_global_snapshot_dir
/somewhere/else

For the best results you will likely want to set this in
the MCA
params file in your home directory:
  shell$ cat ~/.openmpi/mca-params.conf
  snapc_base_global_snapshot_dir=/somewhere/else

You can also stage the file to local disk, then have Open
MPI transfer
the checkpoints back to a {logically} central storage
device (both can
be /tmp on a local disk if you like). For more details on
this and the
above option you will want to read through the FT Users
Guide attached
to the wiki page at the link below:
   https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR

-- Josh



Thanks a lot

regards,

Raj

--- On Wed, 6/17/09, Josh Hursey <jjhur...@open-mpi.org>
wrote:

From: Josh Hursey <jjhur...@open-mpi.org>
Subject: Re: [OMPI users] vfs_write returned -14
To: "Open MPI Users" <us...@open-mpi.org>
Date: Wednesday, June 17, 2009, 1:42 AM
Did you try checkpointing a non-MPI
application with BLCR on the
cluster? If that does not work then I would
suspect that
BLCR is not
working properly on the system.

However if a non-MPI application can be
checkpointed and
restarted
correctly on this machine then it may be something
odd with
the Open
MPI installation or runtime environment. To help
debug here
I would
need to know how Open MPI was configured and how
the
application was
ran on the machine (command line arguments,
environment
variables, ...).

I should note that for the program that you sent
it is
important that
you compile Open MPI with the Fault Tolerance
Thread
enabled to ensure
a timely checkpoint. Otherwise the checkpoint will
be
delayed until
the MPI program enters the MPI_Finalize function.

Let me know what you find out.

Josh

On Jun 16, 2009, at 5:08 PM, Kritiraj Sajadah
wrote:


Hi Josh,

Thanks for the email. I have install BLCR
0.8.1 and
openmpi 1.3 on
my laptop with Ubuntu 8.04 on it. It works
fine.

I now tried the installation on the cluster (
on one
machine for
now) in my university. ( the administrator
installed
it) i am not
sure if he followed the steps i gave him.

I am checkpointing a simple mpi application
which
looks as follows:

#include <mpi.h>
#include <stdio.h>

int main(int argc, char **argv)
{
int rank,size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("I am processor no %d of a total of %d
procs
\n", rank, size);
system("sleep 30");
printf("I am processor no %d of a total of %d
procs
\n", rank, size);
system("sleep 30");
printf("I am processor no %d of a total of %d
procs
\n", rank, size);
system("sleep 30");
printf("bye \n");
MPI_Finalize();
return 0;
}

Do you think its better to re install BLCR?


Thanks

Raj
--- On Tue, 6/16/09, Josh Hursey <jjhur...@open-mpi.org>
wrote:

From: Josh Hursey <jjhur...@open-mpi.org>
Subject: Re: [OMPI users] vfs_write
returned -14
To: "Open MPI Users" <us...@open-mpi.org>
Date: Tuesday, June 16, 2009, 6:42 PM

These are errors from BLCR. It may be a
problem
with your
BLCR installation and/or your application.
Are you
able to
checkpoint/restart a non-MPI application
with BLCR
on these
machines?

What kind of MPI application are you
trying to
checkpoint?
Some of the MPI interfaces are not fully
supported
at the
moment (outlined in the FT User Document
that I
mentioned in
a previous email).

-- Josh

On Jun 16, 2009, at 11:30 AM, Kritiraj
Sajadah
wrote:


Dear All,

  I
have install
openmpi 1.3 and blcr 0.8.1 on a linux
machine
(ubuntu).
however, when i try checkpointing an MPI
application, I get
the following error:

- vfs_write returned -14
- file_header: write returned -14

Can someone help please.

Regards,

Raj






_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to