So while working on the error message, I noticed that the global
coordinator was using the wrong path to investigate the checkpoint
metadata. This particular section of code is not often used (which is
probably why I could not reproduce). I just committed a fix to the
Open MPI development trunk:
https://svn.open-mpi.org/trac/ompi/changeset/22479
Additionally, I am asking for this to be brought over to the v1.4 and
v1.5 release branches:
https://svn.open-mpi.org/trac/ompi/ticket/2195
https://svn.open-mpi.org/trac/ompi/ticket/2196
It seems to solve the problem as I could reproduce it. Can you try the
trunk (either SVN checkout or nightly tarball from tonight) and check
if this solves your problem?
Cheers,
Josh
On Jan 25, 2010, at 12:14 PM, Josh Hursey wrote:
I am not able to reproduce this problem with the 1.4 branch using a
hostfile, and node configuration like you mentioned.
I suspect that the error is caused by a failed local checkpoint. The
error message is triggered when the global coordinator (located in
'mpirun') tries to read the metadata written by the application in
the local snapshot. If the global coordinator cannot properly read
the metadata, then it will print a variety of error messages
depending on what is going wrong.
If these are the only two errors produced, then this typically means
that the local metadata file has been found, but is empty/corrupted.
Can you send me the contents of the local checkpoint metadata file:
shell$ cat GLOBAL_SNAPSHOT_DIR/ompi_global_snapshot_YYY.ckpt/0/
opal_snapshot_0.ckpt/snapshot_meta.data
It should look something like:
---------------------------------
#
# PID: 23915
# Component: blcr
# CONTEXT: ompi_blcr_context.23915
---------------------------------
It may also help to see the following metadata file as well:
shell$ cat GLOBAL_SNAPSHOT_DIR/ompi_global_snapshot_YYY.ckpt/
global_snapshot_meta.data
If there are other errors printed by the process, that would
potentially indicate a different problem. So if there are, let me
know.
This error message should be a bit more specific about which process
checkpoint is causing the problem, and what the this usually
indicates. I filed a bug to cleanup the error:
https://svn.open-mpi.org/trac/ompi/ticket/2190
-- Josh
On Jan 21, 2010, at 8:27 AM, Jean Potsam wrote:
Hi Josh/all,
I have upgraded the openmpi to v 1.4 but still get the same error
when I try executing the application on multiple nodes:
*******************
Error: expected_component: PID information unavailable!
Error: expected_component: Component Name information unavailable!
*******************
I am running my application from the node 'portal11' as follows:
mpirun -am ft-enable-cr -np 2 --hostfile hosts myapp.
The file 'hosts' contains two host names: portal10, portal11.
I am triggering the checkpoint using ompi-checkpoint -v 'PID' from
portal11.
I configured open mpi as follows:
#####################
./configure --prefix=/home/jean/openmpi/ --enable-picky --enable-
debug --enable-mpi-profile --enable-mpi-cxx --enable-pretty-print-
stacktrace --enable-binaries --enable-trace --enable-static=yes --
enable-debug --with-devel-headers=1 --with-mpi-param-check=always --
with-ft=cr --enable-ft-thread --with-blcr=/usr/local/blcr/ --with-
blcr-libdir=/usr/local/blcr/lib --enable-mpi-threads=yes
#########################
Question:
what do you think can be wrong? Please instruct me on how to
resolve this problem.
Thank you
Jean
--- On Mon, 11/1/10, Josh Hursey <jjhur...@open-mpi.org> wrote:
From: Josh Hursey <jjhur...@open-mpi.org>
Subject: Re: [OMPI users] checkpointing multi node and multi
process applications
To: "Open MPI Users" <us...@open-mpi.org>
Date: Monday, 11 January, 2010, 21:42
On Dec 19, 2009, at 7:42 AM, Jean Potsam wrote:
> Hi Everyone,
> I am trying to checkpoint an mpi
application running on multiple nodes. However, I get some error
messages when i trigger the checkpointing process.
>
> Error: expected_component: PID information unavailable!
> Error: expected_component: Component Name information unavailable!
>
> I am using open mpi 1.3 and blcr 0.8.1
Can you try the v1.4 release and see if the problem persists?
>
> I execute my application as follows:
>
> mpirun -am ft-enable-cr -np 3 --hostfile hosts gol.
>
> My question:
>
> Does openmpi with blcr support checkpointing of multi node
execution of mpi application? If so, can you provide me with some
information on how to achieve this.
Open MPI is able to checkpoint a multi-node application (that's
what it was designed to do). There are some examples at the link
below:
http://www.osl.iu.edu/research/ft/ompi-cr/examples.php
-- Josh
>
> Cheers,
>
> Jean.
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users