I've recently had the chance to see how Open MPI (as well as other MPIs) behave in the case of network failure.

I've looked at what happens when a node has its network connection disconnected in the middle of a job, with Ethernet, Myrinet (GM), and InfiniBand (OpenIB).

With Ethernet and Myrinet, the job more or less pauses until the cable is re-connected. (I imagine timeouts still apply, but I wasn't patient enough to wait for them)

With InfiniBand, the job pauses and Open MPI throws a few error messages. After the cable is plugged back in (and the SM catches up), the job remains where it was when it was paused. I'd guess that part of this is that the timeout is much shorter with IB than with Myri or Ethernet, and that when I unplug the IB cable, it times out fairly quickly (and then Open MPI throws its error messages).

At any rate, the thought occurs (and it may just be my ignorance of MPI): After a network connection times out (as was apparently the case with IB), is the job salvageable? If the jobs are not salvageable, why didn't Open MPI abort the job (and clean up the running processes on the nodes)?
--
Troy Telford

Reply via email to