Re: The tale of a TCP bug
On Thursday, March 24, 2011 7:02:35 pm Stefan `Sec` Zehl wrote: > Hi, > > I just subscribed to this list, so sorry if I missed some previous > discussion on this. > > On Thu, Mar 24, 2011 at 16:15 -0400, John Baldwin wrote: > [...] > > Otherwise, something like this may apply instead: > > > > Index: tcp_input.c > > === > > --- tcp_input.c (revision 219911) > > +++ tcp_input.c (working copy) > > @@ -1694,7 +1694,10 @@ tcp_do_segment(struct mbuf *m, struct tcphdr *th, > > win = sbspace(&so->so_rcv); > > if (win < 0) > > win = 0; > > - tp->rcv_wnd = imax(win, (int)(tp->rcv_adv - tp->rcv_nxt)); > > + if (SEQ_GEQ(tp->rcv_adv, tp->rcv_nxt)) > > + tp->rcv_wnd = imax(win, (int)(tp->rcv_adv - tp->rcv_nxt)); > > + else > > + tp->rcv_wnd = win; > > > > /* Reset receive buffer auto scaling when not in bulk receive mode. */ > > tp->rfbuf_ts = 0; > > > > I think that will fix tp->rcv_wnd to be correct in this case thus fixing > > further uses of it. > > I just quickly tested it on my bug scenario, and it still generates > adv=-1 in tcp_output > > That is because win=65536, which is bigger than the actually advertised > window (65535, the max that can be advertised without window scaling). Ah, ok. Can you try this patch first (without the other)? If it doesn't work then we can refine the patch above further. Index: tcp_output.c === --- tcp_output.c(revision 215582) +++ tcp_output.c(working copy) @@ -928,7 +928,8 @@ if (recwin < (long)(so->so_rcv.sb_hiwat / 4) && recwin < (long)tp->t_maxseg) recwin = 0; - if (recwin < (long)(tp->rcv_adv - tp->rcv_nxt)) + if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt) && + recwin < (long)(tp->rcv_adv - tp->rcv_nxt)) recwin = (long)(tp->rcv_adv - tp->rcv_nxt); if (recwin > (long)TCP_MAXWIN << tp->rcv_scale) recwin = (long)TCP_MAXWIN << tp->rcv_scale; -- John Baldwin ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: The tale of a TCP bug
Hi, On Fri, Mar 25, 2011 at 08:25 -0400, John Baldwin wrote: > Ah, ok. Can you try this patch first (without the other)? If it doesn't > work then we can refine the patch above further. I tried it completely unpatched and with your new patch. In both cases that if() statement is not taken. Instrumenting this part of the code with printf()s shows that recwin is 65536 right after your patched if, but reduced to 65535 by the next statment. | if (recwin > (long)TCP_MAXWIN << tp->rcv_scale) | recwin = (long)TCP_MAXWIN << tp->rcv_scale; That's the same effect as in the the affected adv calculation: % long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) - % (tp->rcv_adv - tp->rcv_nxt); recwin is 65535, but the min limits it to 65535. CU, Sec -- But anyway, once I did that, it ran fine! Accelerated, full-screen. Then I remembered I don't like Quake :-) ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: The tale of a TCP bug
On Friday, March 25, 2011 3:41:09 pm Stefan `Sec` Zehl wrote: > Hi, > > On Fri, Mar 25, 2011 at 08:25 -0400, John Baldwin wrote: > > Ah, ok. Can you try this patch first (without the other)? If it doesn't > > work then we can refine the patch above further. > > I tried it completely unpatched and with your new patch. In both cases > that if() statement is not taken. > > Instrumenting this part of the code with printf()s shows that recwin is > 65536 right after your patched if, but reduced to 65535 by the next > statment. > > | if (recwin > (long)TCP_MAXWIN << tp->rcv_scale) > | recwin = (long)TCP_MAXWIN << tp->rcv_scale; > > That's the same effect as in the the affected adv calculation: > > % long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) - > % (tp->rcv_adv - tp->rcv_nxt); > > recwin is 65535, but the min limits it to 65535. Reading some more. I'm trying to understand the breakage in your case. You are saying that FreeBSD is the sender, who has data to send, yet is not sending any window probes because it never starts the persist timer when the initial window is zero? Is that correct? And the problem is that the code that uses 'adv' to determine if it sound send a window update to the remote end is falsely succeeding due to the overflow causing tcp_output() to 'goto send' but that it then fails to send any data because it thinks the remote window is full? So one thing I don't quite follow is how you are having rcv_nxt > rcv_adv. I saw this when the other side would send a window probe, and then the receiving side would take the -1 remaining window and explode it into the maximum window size when it ACKd. Are you seeing the other end of the connection send a window probe, but FreeBSD is not setting the persist timer so that it will send its own window probes? -- John Baldwin ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: why use INP_WLOCK instead of INP_RLOCK
On Tuesday, February 01, 2011 12:54:33 am Jim wrote: > Hi All, > > I am not sure if anybody has asked it before. I could not find answer by > doing rough search on Internet, if it is duplicate question, sorry in > advance. > > My question is that, for getting socket options in tcp_ctloutput() in > tcp_usrreq.c, why do we need to do lock with INP_WLOCK(inp) as setting > socket options does. Why do we just use INP_RLOCK(inp), as it looks not > changing anything in tcp control block? > > Thank you for your kindly answer. I think mostly it is just because no one has bothered to change it. Realistically it probably won't make any noticable difference unless your workload consists of doing lots of calls to getsockopt() but not sending any actual traffic on the associated sockets. :) (Almost all of the other operations on a TCP connection require a write lock on the pcb.) -- John Baldwin ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: why use INP_WLOCK instead of INP_RLOCK
On 25 Mar 2011, at 21:01, John Baldwin wrote: > On Tuesday, February 01, 2011 12:54:33 am Jim wrote: >> I am not sure if anybody has asked it before. I could not find answer by >> doing rough search on Internet, if it is duplicate question, sorry in >> advance. >> >> My question is that, for getting socket options in tcp_ctloutput() in >> tcp_usrreq.c, why do we need to do lock with INP_WLOCK(inp) as setting >> socket options does. Why do we just use INP_RLOCK(inp), as it looks not >> changing anything in tcp control block? > > I think mostly it is just because no one has bothered to change it. > Realistically it probably won't make any noticable difference unless your > workload consists of doing lots of calls to getsockopt() but not sending any > actual traffic on the associated sockets. :) (Almost all of the other > operations on a TCP connection require a write lock on the pcb.) Just to reiterate John's point here: the critical performance paths for TCP both require the inpcb lock to be held exclusively (input and output), and socket options are typically called from the same user thread doing I/O, meaning that acquiring read locks instead of write locks is unlikely to make any measurable difference. However, in principle I believe most if not all getsockopt()'s in TCP should be fine with just a read lock, and for socket options used with UDP, there might well be some benefit to using a read lock, since most UDP operations use read locks and note write locks on the inpcb. I should further note that Jeff Roberson has some exciting in-progress work to reduce transmit-input contention on the inpcb that appears to make quite a noticeable difference in improving TCP performance. We don't have much global lock contention currently when in the steady state, but the per-connection locks do get heavily contended. His work is similar to some work done in the Linux stack a year or two ago to defer input processing to the user thread rather than contending on the inpcb lock, if it's already held. Hopefully we'll see the results of that work in 9.0, and possibly backported to 8.x. I also have a large pending patchset adding connection group support, and aligning software lookup tables with hardware work distribution via RSS, which is due to go in before 9.0. Robert___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
PR kern/155772 (&& maybe kern/155555)...
[ Please keep me CC'ed ] Hello. It looks like I'm running in to kern/155772. The ticket was sparse on details so here's how to reproduce the problem: I have two routers using openospfd && bgpd: br1 <-- igb0 -> br2 | | | igb1 | igb1 | | +--+ 10.1.1.0/24 br1: ifconfig igb1 inet 10.1.1.2 netmask 255.255.255.0 br2: ifconfig igb1 inet 10.1.1.3 netmask 255.255.255.0 When I restart ospfd on br2 (not reload), *br1* looses its connectivity to all directly connected routes (both igb0 and igb1). If I perform a `route -n get 10.1.1.10` on br1, route(8) returns the right information for igb1, however when I ping 10.1.1.10 from br1, I get a network not reachable message. `ifconfig igb1` returns the right information as well. If I re-add the IP address via `ifconfig igb1 inet 10.1.1.2 netmask 255.255.255.255` I am able to reach the 10.1.1.0/24 network. It's like the FIB information is being removed for directly connected routes when ospfd updates br1's FIB. I've actually observed the same behavior when retracting statically connected routes via openbgpd as well (while attempting to trace the root of this problem, I configured an iBGP session to use a "next-hop self" rule and observed an identical behavior, same with quagga too). I was thinking this was related to kern/15 but I don't think that's the direct problem. http://www.freebsd.org/cgi/query-pr.cgi?pr=15 If anyone has patches or ideas, let me know. I have plenty of details if anyone's interested, but this seems so easy to reproduce or like a glaring regression. It feels very much like a refcount for routes isn't being incremented along the way or that some piece of logic is removing all matching routes, not just the routes learned from a routing protocol. On a similar, but unrelated note of carp(4), I ran in to the problem described in the following post during the initial setup of these two routers: http://lists.freebsd.org/pipermail/freebsd-net/2011-March/028349.html This happened because br2's default route was pointing to br1, br1 was receiving its own carp packets while br2 was forwarding them back. Recompiling with MROUTING and enabling igmpproxy(8) seemed to fix the problem. Thanks in advance. -sc PS This one also bit me during turn up: http://www.freebsd.org/cgi/query-pr.cgi?pr=155030 PPS And, if you had a downstream hosts that have pf(4) enabled and had their default route set to 10.1.1.2, once you changed the default route to 10.1.1.1 or 10.1.1.3, all of the hosts that had pf(4) on them had to have their firewall rules reloaded (e.g. /etc/rc.d/pf reload) to reflect this changed default route. In previous 7.X, pf(4) picked up this change without needing to reload the rules. -- Sean Chittenden s...@chittenden.org ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"