On Sun, Jun 3, 2012 at 5:22 PM, Lawrence Stewart <lstew...@freebsd.org> wrote: > On 06/03/12 15:18, Kevin Oberman wrote: >> >> On Fri, Jun 1, 2012 at 2:48 AM, Lawrence Stewart<lstew...@freebsd.org> >> wrote: >>> >>> On 05/31/12 13:33, Kevin Oberman wrote: >>> [snip] >>>> >>>> >>>> I used SIFTR at the suggestion of Lawrence Stewart who headed the >>>> >>>> project to bring plugable congestion algorithms to FreeBSD and found >>>> really odd congestion behavior. First, I do see a triple ACK, but the >>>> congestion window suddenly drops from 73K to 8K. If I understand >>>> CUBIC, it should half the congestion window, not what is happening.. >>>> It then increases slowly (in slow start) to 82K. while the slow-start >>>> bytes are INCREASING, the congestion window again goes to 8K while the >>>> SS size moves from 36K up to 52K. It just continues to bound wildly >>>> between 8K (always the low point) and between 64k and 82K. The swings >>>> start at 83K and, over the first few seconds the peaks drop to about >>>> 64K. >>> >>> >>> >>> Oh, and a comment about this behaviour. Dropping back to 8k (1MSS) is >>> only >>> nasty if the TF_{CONG|FAST}RECOVERY flags are *not* set i.e. if you see >>> cwnd >>> grow, drop to 8k with those flags set, and then when the flags are unset, >>> cwnd starts at the value of ssthresh, then that is perfectly normal >>> recovery >>> behaviour. What *is* nasty is if an RTO fires, which will reset cwnd to >>> 8k, >>> ssthresh to 2*MSS and make the connection effectively start from scratch >>> again. >>> >>> There is evidence of RTOs in your siftr output, which is bad news e.g >>> here's >>> one example of 2 side-by-side log lines from your trace: >>> >>> # Direction,time,ssthresh,cwnd,flags >>> i,1338319593.574706,27044,27044,1630544864 >>> o,1338319593.831482,16384,8192,1092625377 >>> >>> Note the 300ms gap, and how cwnd resets to 1MSS and flags go from >>> 1630544864 >>> (TF_WASCRECOVERY|TF_CONGRECOVERY|TF_WASFRECOVERY|TF_FASTRECOVERY) to >>> 1092625377 (TF_WASCRECOVERY|TF_WASFRECOVERY). >> >> >> What can I say but that you are right. When I looked at the interface >> stats I found that the link overflow drops were through the roof! This >> confuses me a bit since the traffic is outbound and I woudl assume >> from the description on hte Myricom web page that these are input >> drops. A problem a problem with that card? On systems that are >> working "normally", I still see a sharp drop with the ToS bits set, >> but nothing nearly as drastic. Now it is a drop from 4.5G to 728M on a >> cross-country (US) circuit. >> >> I am now looking for issues on the route that might explain the >> performance, but the question of why the drop-of only shows up in >> FreeBSD 8 means something odd is still going on. It is even possible >> that the problem is with 7 and the losses are due to the policy for >> ToS 32 on the path. ToS 32 is less than best effort in our network. >> Maybe the marking was getting lost on 7. Not likely, but possible. > > > The receiver is FreeBSD 7? If so, have you tuned your reassembly queue on > that machine? If not, that could explain the RTOs you're seeing. Send > through the output of "sysctl net.inet.tcp.reass" and "netstat -sp tcp" > obtained from the receiver immediately before and after running a short > ToS=32 test.
I just wanted to let those kind enough to help with this that I have analyzed the problem and pretty much understand what is happening. I've done a lot of testing fully understand what is going on. First, the problem is clearly tied to FreeBSD 8, but it is not anything wrong with FreeBSD. Instead it is a real fluke problem with the handing of the DSCP and TOS bits by a single Juniper router when TSO is used. V7 did not support TOS, so v7 does not show the problem. I have done packet capture on both ends and something really strange happens with the TSO. I see a couple of large segments move normally. Then things start getting weird. As soon as slow-start allows things to speed up just a bit, the second segment of a transfer is discarded and TCP tries to recover, but with the long pipe (RTT is around 50 ms. at 10G, there is a lot of data in the pipe when the problem is detected and the NAK is sent. Actually, 7 or 8 are sent before the transmitting system receives one and can start to recover. Then, it just happens again and again. the root problem is a router that seems to be re-marking the ToS bits from 0x20 to 0x24 which is adding the "loss priority" bit. Even though the circuit is not busy, TSO results in all segments being sent "back-to-back" and, with the change in the IP Precedence bits, the second packet gets dropped if ANY other traffic is present. We have a ticket open with the router vendor and I hope that we can get this resolved quickly, but I would not bet on it. In nay case, it is not a FreeBSD issue, though some what I see makes me suspect that our stack may not be responding well to this situation of massive loss in large segments. But the losses are so severe that I am far from certain and really can't expect anything but terrible results. Again, thinks to Lawrence, Bjorn, and Andrew for their and efforts to look at this and the the wireshark folks, without whom I would probably still be trying to understand what is going on. -- R. Kevin Oberman, Network Engineer E-mail: kob6...@gmail.com _______________________________________________ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"