(Sorry for the delay in replying, more below)
On Apr 8, 2010, at 1:34 PM, Fernando Lemos wrote:
Hello,
I've noticed that ompi-restart doesn't support the --rankfile option.
It only supports --hostfile/--machinefile. Is there any reason
--rankfile isn't supported?
Suppose you have a cluster without a shared file system. When one node
fails, you transfer its checkpoint to a spare node and invoke
ompi-restart. In 1.5, ompi-restart automagically handles this
situation (if you supply a hostfile) and is able to restart the
process, but I'm afraid it might not always be able to find the
checkpoints this way. If you could specify to ompi-restart where the
ranks are (and thus where the checkpoints are), then maybe restart
would always work as long (as long as you've specified the location of
the checkpoints correctly), or maybe ompi-restart would be faster.
We can easily add the --rankfile option to ompi-restart. I filed a
ticket to add this option, and assess if there are other options that
we should pass along (e.g., --npernode, --byhost). I should be able to
fix this in the next week or so, but the ticket is linked below so you
can follow the progress.
https://svn.open-mpi.org/trac/ompi/ticket/2413
Most of the ompi-restart parameters are passed directly to the mpirun
command. ompi-restart is mostly a wrapper around mpirun that is able
to parse the metadata and create the appcontext file. I wonder if a
more general parameter like '--mpirun-args ...' might make sense so
users don't have to wait on me to expose the interface they need.
Donno. What do you think? Should I create a '--mpirun-args' option or
duplicate all of the mpirun command line parameters, or some
combination of the two.
-- Josh
Regards,
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users