Dear users, Our cluster has a number of nodes which have high probability to crash, so it happens quite often that calculations stop due to one node getting down. May be you know if it is possible to block the crashed nodes during run-time when running with OpenMPI? I am asking about principal possibility to program such behavior. Does OpenMPI allow such dynamic checking? The scheme I am curious about is the following:
1. A code starts its tasks via mpirun on several nodes 2. At some moment one node gets down 3. The code realizes that the node is down (the results are lost) and excludes it from the list of nodes to run its tasks on 4. At later moment the user restarts the crashed node 5. The code notices that the node is up again, and puts it back to the list of active nodes Regards, Andrei