I've recently had the chance to see how Open MPI (as well as other MPIs)
behave in the case of network failure.
I've looked at what happens when a node has its network connection
disconnected in the middle of a job, with Ethernet, Myrinet (GM), and
InfiniBand (OpenIB).
With Ethernet and Myrinet, the job more or less pauses until the cable is
re-connected. (I imagine timeouts still apply, but I wasn't patient
enough to wait for them)
With InfiniBand, the job pauses and Open MPI throws a few error messages.
After the cable is plugged back in (and the SM catches up), the job
remains where it was when it was paused. I'd guess that part of this is
that the timeout is much shorter with IB than with Myri or Ethernet, and
that when I unplug the IB cable, it times out fairly quickly (and then
Open MPI throws its error messages).
At any rate, the thought occurs (and it may just be my ignorance of MPI):
After a network connection times out (as was apparently the case with IB),
is the job salvageable? If the jobs are not salvageable, why didn't Open
MPI abort the job (and clean up the running processes on the nodes)?
--
Troy Telford
- [OMPI users] Fault Tolerance & Behavior Troy Telford
-