Hi all, I was wondering if Open-MPI had any way to detect that a node has crashed, rebooted, etc. I am currently trying to integrate my MPI application with Amazon EC2 spot instances, and since spot instances can be terminated at any time, I would like to try to make it so that my application can detect this node failure, maybe remove the node from the machine file, and restart the application automatically. Right now, when one of the worker nodes is rebooted or terminated, the master that is waiting on the results of that node will just hang, waiting for results that will never come.
Thanks, Claire