Changing some parameters (blcr_checkpoint_cmd):
[aogrd01:08552] crs:blcr: checkpoint(8552, ---)
[aogrd01:08552] crs:blcr: checkpoint_peer(8552, --)
[aogrd01:08552] crs:blcr: get_checkpoint_filename(--, 8552)
[aogrd01:08552] crs:blcr: checkpoint_cmd(8552)
[aogrd01:08552] crs:blcr: blcr_checkpoint_peer: exec
:(/softs/blcr-0.6.5/bin/cr_checkpoint,
/softs/blcr-0.6.5/bin/cr_checkpoint --pid 8552 --file
/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552):
[aogrd01:08552] crs:blcr: thread_callback()
[aogrd01:08552] crs:blcr: thread_callback: Continue.
[aogrd01:08552] crs:blcr: blcr_checkpoint_peer: Thread finished with
status 2
Checkpoint failed: Bad file descriptor
chmod: cannot access `/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552':
No such file or directory
[aogrd01:08552] crs:blcr: move(): Error: Unable to execute the command
<chmod u+rwX /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552> :[256].
crs:blcr chmod: Resource temporarily unavailable
[aogrd01:08552] crs:blcr: checkpoint(): Error: Unable to chmod the
checkpoint file (ompi_blcr_context.8552 in the directory
(/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552) :[256].
crs:blcr: checkpoint: Invalid argument
[aogrd01:08552] opal_cr: inc_core: Error: The checkpoint failed. 256
BLCR don´t generate the context file
(/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552). If I execute the
checkpoint command manually (/softs/blcr-0.6.5/bin/cr_checkpoint --pid
8552 --file /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552) it returns
the same error: Checkpoint failed: Bad file descriptor
Thanks,
Leonardo Fialho
Leonardo Fialho escribió:
Hi All,
Does anybody experiment this error?
[aogrdini:09070] Global) Receive a command message from [[13242,0],0].
...
[aogrd02:07642] Local) Receive a command message.
...
[aogrd01:07938] Local) Receive a command message.
...
[aogrd01:07941] App) signal_handler: Receive Checkpoint Request.
...
[aogrd02:07645] App) signal_handler: Receive Checkpoint Request.
...
[aogrd01:07941] crs:blcr: checkpoint(7941, ---)
[aogrd01:07941] crs:blcr: checkpoint_peer(7941, --)
[aogrd01:07941] crs:blcr: get_checkpoint_filename(--, 7941)
[aogrd01:07941] crs:blcr: checkpoint_cmd(7941)
[aogrd01:07941] crs:blcr: blcr_checkpoint_peer: exec :(cr_checkpoint,
cr_checkpoint --pid 7941 --file
/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.7941):
[aogrd01:07941] crs:blcr: blcr_checkpoint_peer: Child failed to
execute :(-1):
[aogrd01:07941] crs:blcr: blcr_checkpoint_peer: execvp returned -1
...
[aogrd02:07645] crs:blcr: blcr_checkpoint_peer: exec :(cr_checkpoint,
cr_checkpoint --pid 7645 --file
/tmp/opal_snapshot_1.ckpt/ompi_blcr_context.7645):
[aogrd02:07645] crs:blcr: blcr_checkpoint_peer: Child failed to
execute :(-1):
[aogrd02:07645] crs:blcr: blcr_checkpoint_peer: execvp returned -1
...
[aogrd02:07642] Local) Location: [/tmp/opal_snapshot_1.ckpt]
The application stop here and don´t continue the execution. It´s
using libcr version 0.6.5
$ lsof -p 7518
/softs/blcr-0.6.5/0.6.5/lib/libcr.so.0.2.1
After orte-checkpoint command the application process is duplicated on
the nodes, like a child of the original process.
When a run an application with this version and take a checkpoint
manually, I have no problem...
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos
Phone: +34-93-581-2888
Fax: +34-93-581-2478
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478