Hi all,

I was wondering if Open-MPI had any way to detect that a node has crashed, 
rebooted, etc. I am currently trying to integrate my MPI application with 
Amazon EC2 spot instances, and since spot instances can be terminated at any 
time, I would like to try to make it so that my application can detect this 
node failure, maybe remove the node from the machine file, and restart the 
application automatically. Right now, when one of the worker nodes is rebooted 
or terminated, the master that is waiting on the results of that node will just 
hang, waiting for results that will never come. 

Thanks,

Claire  

Reply via email to