Thanks for the reply!

Concerning the mca options for checkpointing:
- are verbosity options (e.g.: crs_base_verbose) limited to 0 and 1 values ?
- in priority options (e.g.: crs_blcr_priority) do lower numbers indicate higher priority ?

By searching in the archives of the mailing list I found two interesting/useful posts: - [1] http://www.open-mpi.org/community/lists/users/2008/09/6534.php (for different checkpointing schemes) - [2] http://www.open-mpi.org/community/lists/users/2009/05/9385.php (for restarting)

Following indications given in [1], I tried to make each process
checkpoint itself in it local /tmp and centralize the resulting
checkpoints in /tmp or $HOME:

Excerpt from mca-params.conf:
-----------------------------
snapc_base_store_in_place=0
snapc_base_global_snapshot_dir=/tmp or $HOME
crs_base_snapshot_dir=/tmp

COMMANDS used:
--------------
mpirun -n 2 -machinefile machines -am ft-enable-cr a.out
ompi-checkpoint mpirun_pid



OUTPUT of ompi-checkpoint -v 16753
--------------------------------------
[ic85:17044] orte_checkpoint: Checkpointing...
[ic85:17044]     PID 17036
[ic85:17044]     Connected to Mpirun [[42098,0],0]
[ic85:17044] orte_checkpoint: notify_hnp: Contact Head Node Process PID 17036 [ic85:17044] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID]
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044]                 Requested - Global Snapshot Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044]                   Pending - Global Snapshot Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044]                   Running - Global Snapshot Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044]             File Transfer - Global Snapshot Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Error - Global Snapshot Reference: ompi_global_snapshot_17036.ckpt



OUTPUT of MPIRUN
----------------
----------------------------
[ic85:17038] crs:blcr: blcr_checkpoint_peer: Thread finished with status 3
[ic86:20567] crs:blcr: blcr_checkpoint_peer: Thread finished with status 3
--------------------------------------------------------------------------
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/ompi_global_snapshot_17036.ckpt/0
Host: ic85

Will continue attempting to launch the process.

--------------------------------------------------------------------------
[ic85:17036] filem:rsh: wait_all(): Wait failed (-1)
[ic85:17036] [[42098,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/snapc/full/snapc_full_global.c at line 1054



Does anyone has an idea about what is wrong?


Best regards,

--
Constantinos



Josh Hursey wrote:
This is described in the C/R User's Guide attached to the webpage below:
  https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR

Additionally this has been addressed on the users mailing list in the past, so searching around will likely turn up some examples.

-- Josh

On Sep 18, 2009, at 11:58 AM, Constantinos Makassikis wrote:

Dear all,

I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS account. By default, it seems that checkpoints are saved in $HOME. However, I would prefer them
to be saved on a local disk (e.g.: /tmp).

Does anyone know how I can change the location where Open MPI saves checkpoints?


Best regards,

--
Constantinos
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to