Josh,
Yesterday at night I made some changes, checkout a new SVN version, and
revise completely the BLCR installation. It´s working fine. I suspect 2
different things:
1) cache or old files (configured with older BLCR version path) in
autom4te, configure or dependencies;
2) some miss configuration in BLCR headers file.
When I checkpoint/restart non-MPI application, such applications,
probably, uses the correct libraries, but BLCR module was probably
compiled with older headers (cache?).
I´m trying to perform the error again, but before these changes (when
it´s not working) BLCR returns the "bad file descriptor" (EBAFD) error,
and the blcr module don´t catch this error, only return (-1) "child failed".
Thanks,
Leonardo Fialho
Josh Hursey escribió:
I don't think I have ever seen this one before. :(
So you are trying to checkpoint the MPI process by hand or a non-MPI
process? Can you confirm that you can successfully checkpoint/restart
a non-MPI process on these machines? What version of the Open MPI
trunk are you using? Have you made any changes to the trunk to produce
this build?
Can you send me the info described here (off-list is ok):
http://www.open-mpi.org/community/help/
-- Josh
On Apr 28, 2008, at 5:10 AM, Leonardo Fialho wrote:
Changing some parameters (blcr_checkpoint_cmd):
[aogrd01:08552] crs:blcr: checkpoint(8552, ---)
[aogrd01:08552] crs:blcr: checkpoint_peer(8552, --)
[aogrd01:08552] crs:blcr: get_checkpoint_filename(--, 8552)
[aogrd01:08552] crs:blcr: checkpoint_cmd(8552)
[aogrd01:08552] crs:blcr: blcr_checkpoint_peer: exec
:(/softs/blcr-0.6.5/bin/cr_checkpoint,
/softs/blcr-0.6.5/bin/cr_checkpoint --pid 8552 --file
/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552):
[aogrd01:08552] crs:blcr: thread_callback()
[aogrd01:08552] crs:blcr: thread_callback: Continue.
[aogrd01:08552] crs:blcr: blcr_checkpoint_peer: Thread finished with
status 2
Checkpoint failed: Bad file descriptor
chmod: cannot access `/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.
8552':
No such file or directory
[aogrd01:08552] crs:blcr: move(): Error: Unable to execute the command
<chmod u+rwX /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552> :
[256].
crs:blcr chmod: Resource temporarily unavailable
[aogrd01:08552] crs:blcr: checkpoint(): Error: Unable to chmod the
checkpoint file (ompi_blcr_context.8552 in the directory
(/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552) :[256].
crs:blcr: checkpoint: Invalid argument
[aogrd01:08552] opal_cr: inc_core: Error: The checkpoint failed. 256
BLCR don´t generate the context file
(/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552). If I execute the
checkpoint command manually (/softs/blcr-0.6.5/bin/cr_checkpoint --pid
8552 --file /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552) it
returns
the same error: Checkpoint failed: Bad file descriptor
Thanks,
Leonardo Fialho
Leonardo Fialho escribió:
Hi All,
Does anybody experiment this error?
[aogrdini:09070] Global) Receive a command message from [[13242,0],
0].
...
[aogrd02:07642] Local) Receive a command message.
...
[aogrd01:07938] Local) Receive a command message.
...
[aogrd01:07941] App) signal_handler: Receive Checkpoint Request.
...
[aogrd02:07645] App) signal_handler: Receive Checkpoint Request.
...
[aogrd01:07941] crs:blcr: checkpoint(7941, ---)
[aogrd01:07941] crs:blcr: checkpoint_peer(7941, --)
[aogrd01:07941] crs:blcr: get_checkpoint_filename(--, 7941)
[aogrd01:07941] crs:blcr: checkpoint_cmd(7941)
[aogrd01:07941] crs:blcr: blcr_checkpoint_peer: exec :(cr_checkpoint,
cr_checkpoint --pid 7941 --file
/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.7941):
[aogrd01:07941] crs:blcr: blcr_checkpoint_peer: Child failed to
execute :(-1):
[aogrd01:07941] crs:blcr: blcr_checkpoint_peer: execvp returned -1
...
[aogrd02:07645] crs:blcr: blcr_checkpoint_peer: exec :(cr_checkpoint,
cr_checkpoint --pid 7645 --file
/tmp/opal_snapshot_1.ckpt/ompi_blcr_context.7645):
[aogrd02:07645] crs:blcr: blcr_checkpoint_peer: Child failed to
execute :(-1):
[aogrd02:07645] crs:blcr: blcr_checkpoint_peer: execvp returned -1
...
[aogrd02:07642] Local) Location: [/tmp/opal_snapshot_1.ckpt]
The application stop here and don´t continue the execution. It´s
using libcr version 0.6.5
$ lsof -p 7518
/softs/blcr-0.6.5/0.6.5/lib/libcr.so.0.2.1
After orte-checkpoint command the application process is duplicated
on
the nodes, like a child of the original process.
When a run an application with this version and take a checkpoint
manually, I have no problem...
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos
Phone: +34-93-581-2888
Fax: +34-93-581-2478
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478