Hi All,

Does anybody experiment this error?

[aogrdini:09070] Global) Receive a command message from [[13242,0],0].
...
[aogrd02:07642] Local) Receive a command message.
...
[aogrd01:07938] Local) Receive a command message.
...
[aogrd01:07941] App) signal_handler: Receive Checkpoint Request.
...
[aogrd02:07645] App) signal_handler: Receive Checkpoint Request.
...
[aogrd01:07941] crs:blcr: checkpoint(7941, ---)
[aogrd01:07941] crs:blcr: checkpoint_peer(7941, --)
[aogrd01:07941] crs:blcr: get_checkpoint_filename(--, 7941)
[aogrd01:07941] crs:blcr: checkpoint_cmd(7941)
[aogrd01:07941] crs:blcr: blcr_checkpoint_peer: exec :(cr_checkpoint, cr_checkpoint --pid 7941 --file /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.7941): [aogrd01:07941] crs:blcr: blcr_checkpoint_peer: Child failed to execute :(-1):
[aogrd01:07941] crs:blcr: blcr_checkpoint_peer: execvp returned -1
...
[aogrd02:07645] crs:blcr: blcr_checkpoint_peer: exec :(cr_checkpoint, cr_checkpoint --pid 7645 --file /tmp/opal_snapshot_1.ckpt/ompi_blcr_context.7645): [aogrd02:07645] crs:blcr: blcr_checkpoint_peer: Child failed to execute :(-1):
[aogrd02:07645] crs:blcr: blcr_checkpoint_peer: execvp returned -1
...
[aogrd02:07642] Local)   Location:        [/tmp/opal_snapshot_1.ckpt]

The application stop here and don´t continue the execution. It´s using libcr version 0.6.5
$ lsof -p 7518
/softs/blcr-0.6.5/0.6.5/lib/libcr.so.0.2.1

After orte-checkpoint command the application process is duplicated on the nodes, like a child of the original process. When a run an application with this version and take a checkpoint manually, I have no problem...

Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos
Phone: +34-93-581-2888
Fax: +34-93-581-2478

Reply via email to