Andrew> excuse my tardiness; i was on a conference bridge yesterday
Andrew> for 11 hours helping resolve a sev 1 issue caused by said TCP
Andrew> 0 window messages.

Ouch, not fun at all!  

Andrew> there is way too much context for me to type here but let me
Andrew> relate the highlights as a "lessons learned" missive.

Thank you for sharing this with the rest of us.  It's always good to
get the after the fact reports so that others can hopefully learn.

Andrew> our application gets a modest amount of data streamed at it
Andrew> (50MB/s) from a set of servers on the other side of a
Andrew> firewall. the visible symptom is that just a little while
Andrew> after opening a socket and sending data, the sending side
Andrew> would stop sending data.

So as I understand it, you have:

   app-servers <-> channel <-> firewall <-> channel <-> src-servers

The app-servers are Linux, I don't recall if you mentioned anything
about the src-servers.

Andrew> around 4am thursday morning, the above symptom kicked in for
Andrew> all 80 sockets connecting 5 source servers to 8 destination
Andrew> servers roughly simultaneously (within 2-4 minutes). we had
Andrew> changed nothing on the server's software.

Is this when the channel lost some of it's links?  Has anyone figured
out when those went down? 

Andrew> throughout the day, as the sev 1 callout got escalated and
Andrew> different shifts of people got called, i had to re-explain our
Andrew> theory that although the networking folks could not find
Andrew> anything at all, save one link that occasionally had some
Andrew> dropped packets (around 0.1-1%), it had to be an external
Andrew> thing (because all 8 servers go it at the same time) and was
Andrew> most likely the network. we ended up rebooting all teh servers
Andrew> a few times and reloading all the app software, to no avail.

So that tends to make me think that it's not Linux's networking
causing the problem, but something in the switches, routers or
firewall and how they handled the channel failures.  

Andrew> because there was literally nothing else to do, the networking
Andrew> folks worked on the link that was dropping packets and found
Andrew> it was a 4 port channel with a couple of ports down and
Andrew> utilisation was high. so given they actually had a guy on
Andrew> site, they had him go and check it physically. long story
Andrew> short, he was able to replace a couple broken GBICs and fix a
Andrew> miscabling and voila! the link was now running at 4x the
Andrew> original speed.  and no packet loss!!

This implies that this channel was not completely working properly
even before, but had at least one link down due to mis-cabling, but
that there was a failure in another link which dropped things down to
50% availability of links, correct?  

Andrew> and wouldn't you know it, approximately 10s later, no more TCP
Andrew> 0 window messages and our data streaming started working. and
Andrew> has worked flawlessly since. so despite the fact that "this
Andrew> minorly defective link couldn't possibly" cause the problem,
Andrew> it apparently did. (although no-one could explain the
Andrew> mechanism.)

This, to me, points to the channel being the problem, or more
accurately, the end-points of that channel being the root cause of the
problem, where the packets going over that channel were getting
corrupted in some manner.  And it seems like it would be a simple
thing to re-create in the lab to test out.

In any case, the root cause is *still* not known, though I'm sure that
the linux networking guys would be interested in knowing about this,
if you can share some tcp dumps with them and more details on your
exact setup.  And of course since I bet you're using RHEL kernels, the
first thing they will ask you do is to upgrade to a recent upstream
kernel.  Which of course you might not want to do in a production
setup.  So that leaves going back to RH for support.  And the
switch/router/firewall vendors.  Yay, lots of finger pointing!

Good luck, and I'm certainly interested to hear what you come up with
as a root cause.

John
_______________________________________________
Tech mailing list
Tech@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to