Changing some parameters (blcr_checkpoint_cmd):

[aogrd01:08552] crs:blcr: checkpoint(8552, ---)
[aogrd01:08552] crs:blcr: checkpoint_peer(8552, --)
[aogrd01:08552] crs:blcr: get_checkpoint_filename(--, 8552)
[aogrd01:08552] crs:blcr: checkpoint_cmd(8552)
[aogrd01:08552] crs:blcr: blcr_checkpoint_peer: exec :(/softs/blcr-0.6.5/bin/cr_checkpoint, /softs/blcr-0.6.5/bin/cr_checkpoint --pid 8552 --file /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552):
[aogrd01:08552] crs:blcr: thread_callback()
[aogrd01:08552] crs:blcr: thread_callback: Continue.
[aogrd01:08552] crs:blcr: blcr_checkpoint_peer: Thread finished with status 2
Checkpoint failed: Bad file descriptor
chmod: cannot access `/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552': No such file or directory [aogrd01:08552] crs:blcr: move(): Error: Unable to execute the command <chmod u+rwX /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552> :[256].
crs:blcr chmod: Resource temporarily unavailable
[aogrd01:08552] crs:blcr: checkpoint(): Error: Unable to chmod the checkpoint file (ompi_blcr_context.8552 in the directory (/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552) :[256].
crs:blcr: checkpoint: Invalid argument
[aogrd01:08552] opal_cr: inc_core: Error: The checkpoint failed. 256

BLCR don´t generate the context file (/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552). If I execute the checkpoint command manually (/softs/blcr-0.6.5/bin/cr_checkpoint --pid 8552 --file /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552) it returns the same error: Checkpoint failed: Bad file descriptor

Thanks,
Leonardo Fialho

Leonardo Fialho escribió:
Hi All,

Does anybody experiment this error?

[aogrdini:09070] Global) Receive a command message from [[13242,0],0].
...
[aogrd02:07642] Local) Receive a command message.
...
[aogrd01:07938] Local) Receive a command message.
...
[aogrd01:07941] App) signal_handler: Receive Checkpoint Request.
...
[aogrd02:07645] App) signal_handler: Receive Checkpoint Request.
...
[aogrd01:07941] crs:blcr: checkpoint(7941, ---)
[aogrd01:07941] crs:blcr: checkpoint_peer(7941, --)
[aogrd01:07941] crs:blcr: get_checkpoint_filename(--, 7941)
[aogrd01:07941] crs:blcr: checkpoint_cmd(7941)
[aogrd01:07941] crs:blcr: blcr_checkpoint_peer: exec :(cr_checkpoint, cr_checkpoint --pid 7941 --file /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.7941): [aogrd01:07941] crs:blcr: blcr_checkpoint_peer: Child failed to execute :(-1):
[aogrd01:07941] crs:blcr: blcr_checkpoint_peer: execvp returned -1
...
[aogrd02:07645] crs:blcr: blcr_checkpoint_peer: exec :(cr_checkpoint, cr_checkpoint --pid 7645 --file /tmp/opal_snapshot_1.ckpt/ompi_blcr_context.7645): [aogrd02:07645] crs:blcr: blcr_checkpoint_peer: Child failed to execute :(-1):
[aogrd02:07645] crs:blcr: blcr_checkpoint_peer: execvp returned -1
...
[aogrd02:07642] Local)   Location:        [/tmp/opal_snapshot_1.ckpt]

The application stop here and don´t continue the execution. It´s using libcr version 0.6.5
$ lsof -p 7518
/softs/blcr-0.6.5/0.6.5/lib/libcr.so.0.2.1

After orte-checkpoint command the application process is duplicated on the nodes, like a child of the original process. When a run an application with this version and take a checkpoint manually, I have no problem...

Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos
Phone: +34-93-581-2888
Fax: +34-93-581-2478
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478

Reply via email to