Did you try checkpointing a non-MPI application with BLCR on the cluster? If that does not work then I would suspect that BLCR is not working properly on the system.

However if a non-MPI application can be checkpointed and restarted correctly on this machine then it may be something odd with the Open MPI installation or runtime environment. To help debug here I would need to know how Open MPI was configured and how the application was ran on the machine (command line arguments, environment variables, ...).

I should note that for the program that you sent it is important that you compile Open MPI with the Fault Tolerance Thread enabled to ensure a timely checkpoint. Otherwise the checkpoint will be delayed until the MPI program enters the MPI_Finalize function.

Let me know what you find out.

Josh

On Jun 16, 2009, at 5:08 PM, Kritiraj Sajadah wrote:


Hi Josh,

Thanks for the email. I have install BLCR 0.8.1 and openmpi 1.3 on my laptop with Ubuntu 8.04 on it. It works fine.

I now tried the installation on the cluster ( on one machine for now) in my university. ( the administrator installed it) i am not sure if he followed the steps i gave him.

I am checkpointing a simple mpi application which looks as follows:

#include <mpi.h>
#include <stdio.h>

int main(int argc, char **argv)
{
int rank,size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 30");
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 30");
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 30");
printf("bye \n");
MPI_Finalize();
return 0;
}

Do you think its better to re install BLCR?


Thanks

Raj
--- On Tue, 6/16/09, Josh Hursey <jjhur...@open-mpi.org> wrote:

From: Josh Hursey <jjhur...@open-mpi.org>
Subject: Re: [OMPI users] vfs_write returned -14
To: "Open MPI Users" <us...@open-mpi.org>
Date: Tuesday, June 16, 2009, 6:42 PM

These are errors from BLCR. It may be a problem with your
BLCR installation and/or your application. Are you able to
checkpoint/restart a non-MPI application with BLCR on these
machines?

What kind of MPI application are you trying to checkpoint?
Some of the MPI interfaces are not fully supported at the
moment (outlined in the FT User Document that I mentioned in
a previous email).

-- Josh

On Jun 16, 2009, at 11:30 AM, Kritiraj Sajadah wrote:


Dear All,
          I have install
openmpi 1.3 and blcr 0.8.1 on a linux machine (ubuntu).
however, when i try checkpointing an MPI application, I get
the following error:

- vfs_write returned -14
- file_header: write returned -14

Can someone help please.

Regards,

Raj





_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to