Gregor,Thanks for the bug report. I saw a problem similar to this a few months ago (documented in the ticket below).
https://svn.open-mpi.org/trac/ompi/ticket/1527Though we fixed the accounting information, the patch I had for orte- restart to switch it away from using --hostfile and instead using -- default-hostfile was never applied to the trunk (my fault here). The patch is attached if you want to apply it to make sure it fixes the problem for you.
I have committed the patch to the development trunk (r20305), and asked that it be brought over to the v1.3 branch so it will be included in the v1.3.1 release. If you want to track its progress you can using the ticket below.
https://svn.open-mpi.org/trac/ompi/ticket/1761 Thanks again, Josh
orte-restart-hostfile.patch
Description: Binary data
On Jan 20, 2009, at 5:07 AM, Gregor Dschung wrote:
Hey, I'm trying the new released Open MPI 1.3 in conjunction with BLCR to provide the checkpoint/restart-feature. Configured with ./configure --prefix=/usr/local --with-ft=cr --enable-ft-thread --enable-mpi-threads --with-blcr=/ A MPI-job on a single machine (several threads) is checkpointed and restarted very well. The checkpoint of a MPI-job across two hosts (ethernet, tcp) is also done without warnings or errors (the homedir and the directory, wherethe MPI-Application is, are shared with NFS). The restart works too, but all threads are only started on the host, where I enter the ompi- restart command. Even if I add the -hostfile argument to ompi-restart, only theone host is used. Does anybody has a hint? Thanks, Gregor _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users