On 09/14/2012 07:44 AM, Doug Hughes wrote:
On 9/14/2012 1:17 PM, Paul Graydon wrote:
On 09/14/2012 05:37 AM, Brent Chapman wrote:
Aggregated network links involving multiple parallel circuits
generally use some sort of hash on the 5-tuple (src ip, dest ip,
protocol, src port, dest port) so that packets for a given TCP or
UDP session all get sent down the same parallel circuit; this is an
easy way to help ensure that the packets don't get re-ordered, which
many protocols are sensitive to. However, if the particular "same
parallel circuit" that they get sent down is broken, as appears to
have been the case here, you can wind up with behavior like what you
saw: certain sessions (that happen to get hashed down those broken
circuits) break horribly, while others (that get hashed down
non-broken circuits) are just fine.
I've never really delved into the networking aspects of aggregation,
it's never been something I've had any need to utilise, so forgive me
if these are stupid questions.
Under circumstances with which a port goes down, would the link
aggregation generally be fine? The system would presumably be smart
enough identify that the port isn't working and stop routing traffic
that way? I'm assuming the failure in this case is that the packet
loss was too slight enough to disrupt the aggregation, but disruptive
enough to mess things up?
yes, link aggregation provides redundancy/failover and increased
bandwidth all in one. It's active-active. Worst case it provides
redundancy against physical link issues (cable cut or other failure),
and best case with chassis-based or stacked switches you can have
complete redundancy against even switch or linecard failures. It's a
very mature technology.
Why would this disrupt TCPs guarantee processes (admittedly I'm
assuming the application traffic was TCP and not UDP)? Presumably
the packet would fail to reach the other side so the sender would
resend having failed to get an ack?
When you lose a tcp packet, the MSS (effectively bandwidth, but subtly
different) automatically drops by 1/2 (backoff). But, the increase
back up to full speed is very gradual, only one MSS unit at a time.
Therefore small amounts of loss can have drastic affects on throughput.
See: http://en.wikipedia.org/wiki/TCP_congestion_avoidance_algorithm
Here's an old ACM paper on TCP loss algorithms and what happens as
loss approaches .1%. (much terribleness -- Figure 4)
http://ccr.sigcomm.org/archive/1997/jul97/ccr-9707-mathis.pdf
Hmm.. sure, I'd expected there to be some performance loss due to that,
but I guess I didn't really expect that the result would complete
failure, that "a little while after opening a socket and sending data,
the sending side would stop sending data." I suppose that could be as
much an application level reaction as network, though.
It wasn't that unusual to end up dealing with packet loss on WANs when I
was part of an ISPs NOC team, but I don't think I ever dealt with it
where it could cause an application to completely fail. Understandably
the customers who would spot packet loss problems the soonest would be
those who used their connections for VoIP where the disruption would
naturally cause the audible glitches.
Paul
_______________________________________________
Tech mailing list
Tech@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
http://lopsa.org/