(Sorry for the excessive delay in replying)
On Sep 30, 2009, at 11:02 AM, Constantinos Makassikis wrote:
Thanks for the reply!
Concerning the mca options for checkpointing:
- are verbosity options (e.g.: crs_base_verbose) limited to 0 and 1
values ?
- in priority options (e.g.: crs_blcr_priority) do lower numbers
indicate higher priority ?
By searching in the archives of the mailing list I found two
interesting/useful posts:
- [1] http://www.open-mpi.org/community/lists/users/2008/09/6534.php
(for different checkpointing schemes)
- [2] http://www.open-mpi.org/community/lists/users/2009/05/9385.php
(for restarting)
Following indications given in [1], I tried to make each process
checkpoint itself in it local /tmp and centralize the resulting
checkpoints in /tmp or $HOME:
Excerpt from mca-params.conf:
-----------------------------
snapc_base_store_in_place=0
snapc_base_global_snapshot_dir=/tmp or $HOME
crs_base_snapshot_dir=/tmp
COMMANDS used:
--------------
mpirun -n 2 -machinefile machines -am ft-enable-cr a.out
ompi-checkpoint mpirun_pid
OUTPUT of ompi-checkpoint -v 16753
--------------------------------------
[ic85:17044] orte_checkpoint: Checkpointing...
[ic85:17044] PID 17036
[ic85:17044] Connected to Mpirun [[42098,0],0]
[ic85:17044] orte_checkpoint: notify_hnp: Contact Head Node Process
PID 17036
[ic85:17044] orte_checkpoint: notify_hnp: Requested a checkpoint of
jobid [INVALID]
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Requested - Global Snapshot Reference:
(null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Pending - Global Snapshot Reference:
(null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Running - Global Snapshot Reference:
(null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] File Transfer - Global Snapshot Reference:
(null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Error - Global Snapshot Reference:
ompi_global_snapshot_17036.ckpt
OUTPUT of MPIRUN
----------------
----------------------------
[ic85:17038] crs:blcr: blcr_checkpoint_peer: Thread finished with
status 3
[ic86:20567] crs:blcr: blcr_checkpoint_peer: Thread finished with
status 3
--------------------------------------------------------------------------
WARNING: Could not preload specified file: File already exists.
Fileset: /tmp/ompi_global_snapshot_17036.ckpt/0
Host: ic85
Will continue attempting to launch the process.
--------------------------------------------------------------------------
[ic85:17036] filem:rsh: wait_all(): Wait failed (-1)
[ic85:17036] [[42098,0],0] ORTE_ERROR_LOG: Error in
file ../../../../../orte/mca/snapc/full/snapc_full_global.c at line
1054
This is a warning about creating the global snapshot directory
(ompi_global_snapshot_17036.ckpt) for the first checkpoint (seq 0). It
seems to indicate that the directory existed when the file gather
started.
A couple things to check:
- Did you clean out the /tmp on all of the nodes with any files
starting with "opal" or "ompi"?
- Does the error go away when you set
(snapc_base_global_snapshot_dir=$HOME)?
- Could you try running against a v1.3 release? (I wonder if this
feature has been broken on the trunk)
Let me know what you find. In the next couple days, I'll try to test
the trunk again with this feature to make sure that it is still
working on my test machines.
-- Josh
Does anyone has an idea about what is wrong?
Best regards,
--
Constantinos
Josh Hursey wrote:
This is described in the C/R User's Guide attached to the webpage
below:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR
Additionally this has been addressed on the users mailing list in
the past, so searching around will likely turn up some examples.
-- Josh
On Sep 18, 2009, at 11:58 AM, Constantinos Makassikis wrote:
Dear all,
I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS
account. By default,
it seems that checkpoints are saved in $HOME. However, I would
prefer them
to be saved on a local disk (e.g.: /tmp).
Does anyone know how I can change the location where Open MPI
saves checkpoints?
Best regards,
--
Constantinos
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users