HI Averyone, Happy new year 2010. A few weeks ago I posted a query (please see email below) regarding checkpointing applications running on multiple hosts. I am still struggling to find a solution. I would really appreciate if someone could help me.
Thank you. Raj --- On Sat, 12/12/09, Kritiraj Sajadah <ksaja...@yahoo.com> wrote: > From: Kritiraj Sajadah <ksaja...@yahoo.com> > Subject: Problem with checkpointing multihosts, multiprocesses MPI application > To: us...@open-mpi.org > Date: Saturday, December 12, 2009, 3:03 PM > Dear All, > I am trying to > checkpoint am MPI application which has two processes each > running on two seperate hosts. > > I run the application as follows: > > raj@sun32:~$ mpirun -am ft-enable-cr -np 2 --hostfile > sunhost -mca btl ^openib -mca snapc_base_global_snapshot_dir > /tmp m. > > and I trigger the checkpoint as follows: > > raj@sun32:~$ ompi-checkpoint -v 30010 > > > The following happens displaying two errors which > checkpointng the application: > > > ############################################## > I am processor no 0 of a total of 2 procs on host sun32 > I am processor no 1 of a total of 2 procs on host sun06 > I am processorrrrrrrr no 0 of a total of 2 procs on host > sun32 > I am processorrrrrrrr no 1 of a total of 2 procs on host > sun06 > > [sun32:30010] Error: expected_component: PID information > unavailable! > [sun32:30010] Error: expected_component: Component Name > information unavailable! > > I am processssssssssssor no 1 of a total of 2 procs on host > sun06 > I am processssssssssssor no 0 of a total of 2 procs on host > sun32 > bye > bye > ############################################ > > > > > when I try to restart the application from the checkpointed > file, I get the following: > > raj@sun32:~$ ompi-restart ompi_global_snapshot_30010.ckpt > -------------------------------------------------------------------------- > Error: The filename (opal_snapshot_1.ckpt) is invalid > because either you have not provided a filename > or provided an invalid > filename. > Please see --help for > usage. > > -------------------------------------------------------------------------- > I am processssssssssssor no 0 of a total of 2 procs on host > sun32 > bye > > > I would very appreciate if you could give me some ideas on > how to checkpoint and restart MPI application running on > multiple hosts. > > Thank you > > Regards, > > Raj > > > >