Andrew> excuse my tardiness; i was on a conference bridge yesterday Andrew> for 11 hours helping resolve a sev 1 issue caused by said TCP Andrew> 0 window messages.
Ouch, not fun at all! Andrew> there is way too much context for me to type here but let me Andrew> relate the highlights as a "lessons learned" missive. Thank you for sharing this with the rest of us. It's always good to get the after the fact reports so that others can hopefully learn. Andrew> our application gets a modest amount of data streamed at it Andrew> (50MB/s) from a set of servers on the other side of a Andrew> firewall. the visible symptom is that just a little while Andrew> after opening a socket and sending data, the sending side Andrew> would stop sending data. So as I understand it, you have: app-servers <-> channel <-> firewall <-> channel <-> src-servers The app-servers are Linux, I don't recall if you mentioned anything about the src-servers. Andrew> around 4am thursday morning, the above symptom kicked in for Andrew> all 80 sockets connecting 5 source servers to 8 destination Andrew> servers roughly simultaneously (within 2-4 minutes). we had Andrew> changed nothing on the server's software. Is this when the channel lost some of it's links? Has anyone figured out when those went down? Andrew> throughout the day, as the sev 1 callout got escalated and Andrew> different shifts of people got called, i had to re-explain our Andrew> theory that although the networking folks could not find Andrew> anything at all, save one link that occasionally had some Andrew> dropped packets (around 0.1-1%), it had to be an external Andrew> thing (because all 8 servers go it at the same time) and was Andrew> most likely the network. we ended up rebooting all teh servers Andrew> a few times and reloading all the app software, to no avail. So that tends to make me think that it's not Linux's networking causing the problem, but something in the switches, routers or firewall and how they handled the channel failures. Andrew> because there was literally nothing else to do, the networking Andrew> folks worked on the link that was dropping packets and found Andrew> it was a 4 port channel with a couple of ports down and Andrew> utilisation was high. so given they actually had a guy on Andrew> site, they had him go and check it physically. long story Andrew> short, he was able to replace a couple broken GBICs and fix a Andrew> miscabling and voila! the link was now running at 4x the Andrew> original speed. and no packet loss!! This implies that this channel was not completely working properly even before, but had at least one link down due to mis-cabling, but that there was a failure in another link which dropped things down to 50% availability of links, correct? Andrew> and wouldn't you know it, approximately 10s later, no more TCP Andrew> 0 window messages and our data streaming started working. and Andrew> has worked flawlessly since. so despite the fact that "this Andrew> minorly defective link couldn't possibly" cause the problem, Andrew> it apparently did. (although no-one could explain the Andrew> mechanism.) This, to me, points to the channel being the problem, or more accurately, the end-points of that channel being the root cause of the problem, where the packets going over that channel were getting corrupted in some manner. And it seems like it would be a simple thing to re-create in the lab to test out. In any case, the root cause is *still* not known, though I'm sure that the linux networking guys would be interested in knowing about this, if you can share some tcp dumps with them and more details on your exact setup. And of course since I bet you're using RHEL kernels, the first thing they will ask you do is to upgrade to a recent upstream kernel. Which of course you might not want to do in a production setup. So that leaves going back to RH for support. And the switch/router/firewall vendors. Yay, lots of finger pointing! Good luck, and I'm certainly interested to hear what you come up with as a root cause. John _______________________________________________ Tech mailing list Tech@lists.lopsa.org https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech This list provided by the League of Professional System Administrators http://lopsa.org/