Gleb Natapov wrote:

On Mon, Oct 30, 2006 at 11:45:53AM -0700, Troy Telford wrote:
On Sun, 29 Oct 2006 01:34:06 -0700, Gleb Natapov <gl...@voltaire.com> wrote:

If you use OB1 PML (default one) it will never recover from link down
error no matter how many other transports you have. The reason is that
OB1 never tracks what happens with buffers submitted to BTL. So if BTL
can't, for any reason, transmit packet passed to it by OB1 the job will
stuck because OB1 doesn't have enough information to try to resend the
packet via another transport. For this kind of resource tracking there
is DR PML. In case of IB BTL link down event generates error for each
packet submitted for sending to the device. IB BTL simply discards all
this packets and relies on PML to resend them, so even after link up
event a job will not recover if OB1 PML is used with IB BTL. This may be
different with another transports.
This makes sense; one thing I'm wondering now is if the OB1 PML is able (and/or had enough information) to figure out that it can't continue at all, and will abort the job.

In case of openib BTL I don't see how job may recover from link down
event so I think aborting the job is the right thing to do. In case of
other transports if transport can continue after link up event as if
nothing happened it is worth to wait for link up. This capability may be
added to openib BTL too, it's just nobody cares enough.
Ethernet doesn't fail in this case because the TCP stack handles this gracefully. The same behavior can be observed when disconnecting an ethernet cable while a ssh session exists, plug it back in and you are probably good to go, after a bit of time (due to exponential backoff on retrans). For GM/MX over myrinet the timeout is quite high on connection down and the software stack handles this gracefully. For IB the link state transitions from LinkActive to LinkActDefer until LinkDownTimeout expires and the link transitions to LinkDown state. From the spec: LinkDownTimeout occurs when the port state machine has continuously been in the LinkActDefer state for 10ms + 3% /-51% .. I have no idea what that formula means, perhaps my pdf of the spec is messed up.

So transitioning to the LinkDown state is dictated by the IB spec, it would seem that we would want to defer the transition based on a user configurable parameter, this is link layer so it would probably be necessary to do this when loading the IB driver. Am I interpreting this correctly?

- Galen


--
                        Gleb.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to