HI Averyone,
              Happy new year 2010. A few weeks ago I posted a query (please see 
email below) regarding checkpointing applications running on multiple hosts. I 
am still struggling to find a solution. I would really appreciate if someone 
could help me.

Thank you.

Raj




--- On Sat, 12/12/09, Kritiraj Sajadah <ksaja...@yahoo.com> wrote:

> From: Kritiraj Sajadah <ksaja...@yahoo.com>
> Subject: Problem with checkpointing multihosts, multiprocesses MPI application
> To: us...@open-mpi.org
> Date: Saturday, December 12, 2009, 3:03 PM
> Dear All,
>          I am trying to
> checkpoint am MPI application which has two processes each
> running on two seperate hosts.
> 
> I run the application as follows:
> 
> raj@sun32:~$ mpirun -am ft-enable-cr -np 2 --hostfile
> sunhost -mca btl ^openib -mca snapc_base_global_snapshot_dir
> /tmp m.
> 
> and I trigger the checkpoint as follows:
> 
> raj@sun32:~$ ompi-checkpoint -v 30010
> 
> 
> The following happens displaying two errors which
> checkpointng the application:
> 
> 
> ##############################################
> I am processor no 0 of a total of 2 procs on host sun32
> I am processor no 1 of a total of 2 procs on host sun06
> I am processorrrrrrrr no 0 of a total of 2 procs on host
> sun32 
> I am processorrrrrrrr no 1 of a total of 2 procs on host
> sun06 
> 
> [sun32:30010] Error: expected_component: PID information
> unavailable!
> [sun32:30010] Error: expected_component: Component Name
> information unavailable!
> 
> I am processssssssssssor no 1 of a total of 2 procs on host
> sun06
> I am processssssssssssor no 0 of a total of 2 procs on host
> sun32
> bye 
> bye 
> ############################################
> 
> 
> 
> 
> when I try to restart the application from the checkpointed
> file, I get the following:
> 
> raj@sun32:~$ ompi-restart ompi_global_snapshot_30010.ckpt
> --------------------------------------------------------------------------
> Error: The filename (opal_snapshot_1.ckpt) is invalid
> because either you have not provided a filename
>        or provided an invalid
> filename.
>        Please see --help for
> usage.
> 
> --------------------------------------------------------------------------
> I am processssssssssssor no 0 of a total of 2 procs on host
> sun32
> bye 
> 
> 
> I would very appreciate if you could give me some ideas on
> how to checkpoint and restart MPI application running on
> multiple hosts.
> 
> Thank you
> 
> Regards,
> 
> Raj
> 
> 
>       
> 




Reply via email to