Thanks for the reply!
Concerning the mca options for checkpointing:
- are verbosity options (e.g.: crs_base_verbose) limited to 0 and 1 values ?
- in priority options (e.g.: crs_blcr_priority) do lower numbers
indicate higher priority ?
By searching in the archives of the mailing list I found two
interesting/useful posts:
- [1] http://www.open-mpi.org/community/lists/users/2008/09/6534.php
(for different checkpointing schemes)
- [2] http://www.open-mpi.org/community/lists/users/2009/05/9385.php
(for restarting)
Following indications given in [1], I tried to make each process
checkpoint itself in it local /tmp and centralize the resulting
checkpoints in /tmp or $HOME:
Excerpt from mca-params.conf:
-----------------------------
snapc_base_store_in_place=0
snapc_base_global_snapshot_dir=/tmp or $HOME
crs_base_snapshot_dir=/tmp
COMMANDS used:
--------------
mpirun -n 2 -machinefile machines -am ft-enable-cr a.out
ompi-checkpoint mpirun_pid
OUTPUT of ompi-checkpoint -v 16753
--------------------------------------
[ic85:17044] orte_checkpoint: Checkpointing...
[ic85:17044] PID 17036
[ic85:17044] Connected to Mpirun [[42098,0],0]
[ic85:17044] orte_checkpoint: notify_hnp: Contact Head Node Process PID
17036
[ic85:17044] orte_checkpoint: notify_hnp: Requested a checkpoint of
jobid [INVALID]
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Requested - Global Snapshot Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Pending - Global Snapshot Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Running - Global Snapshot Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] File Transfer - Global Snapshot Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Error - Global Snapshot Reference:
ompi_global_snapshot_17036.ckpt
OUTPUT of MPIRUN
----------------
----------------------------
[ic85:17038] crs:blcr: blcr_checkpoint_peer: Thread finished with status 3
[ic86:20567] crs:blcr: blcr_checkpoint_peer: Thread finished with status 3
--------------------------------------------------------------------------
WARNING: Could not preload specified file: File already exists.
Fileset: /tmp/ompi_global_snapshot_17036.ckpt/0
Host: ic85
Will continue attempting to launch the process.
--------------------------------------------------------------------------
[ic85:17036] filem:rsh: wait_all(): Wait failed (-1)
[ic85:17036] [[42098,0],0] ORTE_ERROR_LOG: Error in file
../../../../../orte/mca/snapc/full/snapc_full_global.c at line 1054
Does anyone has an idea about what is wrong?
Best regards,
--
Constantinos
Josh Hursey wrote:
This is described in the C/R User's Guide attached to the webpage below:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR
Additionally this has been addressed on the users mailing list in the
past, so searching around will likely turn up some examples.
-- Josh
On Sep 18, 2009, at 11:58 AM, Constantinos Makassikis wrote:
Dear all,
I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS account.
By default,
it seems that checkpoints are saved in $HOME. However, I would prefer
them
to be saved on a local disk (e.g.: /tmp).
Does anyone know how I can change the location where Open MPI saves
checkpoints?
Best regards,
--
Constantinos
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users