Dear users,

Our cluster has a number of nodes which have high probability to crash, so
it happens quite often that calculations stop due to one node getting down.
May be you know if it is possible to block the crashed nodes during run-time
when running with OpenMPI? I am asking about principal possibility to
program such behavior. Does OpenMPI allow such dynamic checking? The scheme
I am curious about is the following:

1. A code starts its tasks via mpirun on several nodes
2. At some moment one node gets down
3. The code realizes that the node is down (the results are lost) and
excludes it from the list of nodes to run its tasks on
4. At later moment the user restarts the crashed node
5. The code notices that the node is up again, and puts it back to the list
of active nodes


Regards,
Andrei

Reply via email to