Gregor,

Thanks for the bug report. I saw a problem similar to this a few months ago (documented in the ticket below).
  https://svn.open-mpi.org/trac/ompi/ticket/1527
Though we fixed the accounting information, the patch I had for orte- restart to switch it away from using --hostfile and instead using -- default-hostfile was never applied to the trunk (my fault here). The patch is attached if you want to apply it to make sure it fixes the problem for you.

I have committed the patch to the development trunk (r20305), and asked that it be brought over to the v1.3 branch so it will be included in the v1.3.1 release. If you want to track its progress you can using the ticket below.
  https://svn.open-mpi.org/trac/ompi/ticket/1761

Thanks again,
Josh

Attachment: orte-restart-hostfile.patch
Description: Binary data




On Jan 20, 2009, at 5:07 AM, Gregor Dschung wrote:

Hey,

I'm trying the new released Open MPI 1.3 in conjunction with BLCR to
provide the checkpoint/restart-feature.

Configured with ./configure --prefix=/usr/local --with-ft=cr
--enable-ft-thread --enable-mpi-threads --with-blcr=/

A MPI-job on a single machine (several threads) is checkpointed and
restarted very well.

The checkpoint of a MPI-job across two hosts (ethernet, tcp) is also
done without warnings or errors (the homedir and the directory, where
the MPI-Application is, are shared with NFS). The restart works too, but all threads are only started on the host, where I enter the ompi- restart command. Even if I add the -hostfile argument to ompi-restart, only the
one host is used.

Does anybody has a hint?

Thanks,
Gregor
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to