On 01/20/2011 05:50 PM, Olivier SANNIER wrote:
What is the behavior in case a node dies or becomes unreachable?
Your run will be aborted. However there is checkpoint/restart support for Linux
http://www.open-mpi.org/faq/?category=ft
As this is a Win32 program, I'll have to take into account that there is only the<
abort> behavior.
AFAIK yes
So there is no dynamic discovery of nodes available on the network. Unless, of
course, if I was to write a tool that would do it before the actual run is
started.
This is done by a batch system like PBS (torque) or SGE
Is there a monitoring tool that would give me indications of the status and
health of the nodes?
This has nothing to do with MPI. Nagios or Ganglia can do that.
I was more thinking of a tool that would tell me a node is already performing a
task, so that I can avoid having it oversubscribed.
This is also done by a batch system
I've started looking at beowulf clusters, and that lead me to PBS. Am I right
in assuming that PBS (PBSPro or TORQUE) could be used to do the monitoring and
the load balancing I thought of?
Yes, however the terms "monitoring" and "load balancing" are usually
used in other contexts.