date:20110325

Re: The tale of a TCP bug

2011-03-25 Thread John Baldwin

On Thursday, March 24, 2011 7:02:35 pm Stefan `Sec` Zehl wrote:
> Hi,
> 
> I just subscribed to this list, so sorry if I missed some previous
> discussion on this.
> 
> On Thu, Mar 24, 2011 at 16:15 -0400, John Baldwin wrote:
> [...]
> > Otherwise, something like this may apply instead:
> > 
> > Index: tcp_input.c
> > ===
> > --- tcp_input.c (revision 219911)
> > +++ tcp_input.c (working copy)
> > @@ -1694,7 +1694,10 @@ tcp_do_segment(struct mbuf *m, struct tcphdr *th,
> > win = sbspace(&so->so_rcv);
> > if (win < 0)
> > win = 0;
> > -   tp->rcv_wnd = imax(win, (int)(tp->rcv_adv - tp->rcv_nxt));
> > +   if (SEQ_GEQ(tp->rcv_adv, tp->rcv_nxt))
> > +   tp->rcv_wnd = imax(win, (int)(tp->rcv_adv - tp->rcv_nxt));
> > +   else
> > +   tp->rcv_wnd = win;
> >  
> > /* Reset receive buffer auto scaling when not in bulk receive mode. */
> > tp->rfbuf_ts = 0;
> > 
> > I think that will fix tp->rcv_wnd to be correct in this case thus fixing
> > further uses of it.
> 
> I just quickly tested it on my bug scenario,  and it still generates
> adv=-1 in tcp_output
> 
> That is because win=65536, which is bigger than the actually advertised
> window (65535, the max that can be advertised without window scaling).

Ah, ok.  Can you try this patch first (without the other)?  If it doesn't
work then we can refine the patch above further.

Index: tcp_output.c
===
--- tcp_output.c(revision 215582)
+++ tcp_output.c(working copy)
@@ -928,7 +928,8 @@
if (recwin < (long)(so->so_rcv.sb_hiwat / 4) &&
recwin < (long)tp->t_maxseg)
recwin = 0;
-   if (recwin < (long)(tp->rcv_adv - tp->rcv_nxt))
+   if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt) &&
+   recwin < (long)(tp->rcv_adv - tp->rcv_nxt))
recwin = (long)(tp->rcv_adv - tp->rcv_nxt);
if (recwin > (long)TCP_MAXWIN << tp->rcv_scale)
recwin = (long)TCP_MAXWIN << tp->rcv_scale;


-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: The tale of a TCP bug

2011-03-25 Thread Stefan `Sec` Zehl

Hi,

On Fri, Mar 25, 2011 at 08:25 -0400, John Baldwin wrote:
> Ah, ok.  Can you try this patch first (without the other)?  If it doesn't
> work then we can refine the patch above further.

I tried it completely unpatched and with your new patch. In both cases
that if() statement is not taken. 

Instrumenting this part of the code with printf()s shows that recwin is
65536 right after your patched if, but reduced to 65535 by the next
statment.

|   if (recwin > (long)TCP_MAXWIN << tp->rcv_scale)
|   recwin = (long)TCP_MAXWIN << tp->rcv_scale;

That's the same effect as in the the affected adv calculation:

% long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) -
%   (tp->rcv_adv - tp->rcv_nxt);

recwin is 65535, but the min limits it to 65535.

CU,
Sec
-- 
But anyway, once I did that, it ran fine!  Accelerated, full-screen.
Then I remembered I don't like Quake :-)
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: The tale of a TCP bug

2011-03-25 Thread John Baldwin

On Friday, March 25, 2011 3:41:09 pm Stefan `Sec` Zehl wrote:
> Hi,
> 
> On Fri, Mar 25, 2011 at 08:25 -0400, John Baldwin wrote:
> > Ah, ok.  Can you try this patch first (without the other)?  If it doesn't
> > work then we can refine the patch above further.
> 
> I tried it completely unpatched and with your new patch. In both cases
> that if() statement is not taken. 
> 
> Instrumenting this part of the code with printf()s shows that recwin is
> 65536 right after your patched if, but reduced to 65535 by the next
> statment.
> 
> | if (recwin > (long)TCP_MAXWIN << tp->rcv_scale)
> | recwin = (long)TCP_MAXWIN << tp->rcv_scale;
> 
> That's the same effect as in the the affected adv calculation:
> 
> % long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) -
> %   (tp->rcv_adv - tp->rcv_nxt);
> 
> recwin is 65535, but the min limits it to 65535.

Reading some more.  I'm trying to understand the breakage in your case.

You are saying that FreeBSD is the sender, who has data to send, yet is not 
sending any window probes because it never starts the persist timer when the 
initial window is zero?  Is that correct?

And the problem is that the code that uses 'adv' to determine if it sound send 
a window update to the remote end is falsely succeeding due to the overflow 
causing tcp_output() to 'goto send' but that it then fails to send any data 
because it thinks the remote window is full?

So one thing I don't quite follow is how you are having rcv_nxt > rcv_adv.  I 
saw this when the other side would send a window probe, and then the receiving 
side would take the -1 remaining window and explode it into the maximum window 
size when it ACKd.

Are you seeing the other end of the connection send a window probe, but 
FreeBSD is not setting the persist timer so that it will send its own window 
probes?

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: why use INP_WLOCK instead of INP_RLOCK

2011-03-25 Thread John Baldwin

On Tuesday, February 01, 2011 12:54:33 am Jim wrote:
> Hi All,
> 
> I am not sure if anybody has asked it before. I could not find answer by
> doing rough search on Internet, if it is duplicate question, sorry in
> advance.
> 
> My question is that, for getting socket options in tcp_ctloutput() in
> tcp_usrreq.c, why do we need to do lock with INP_WLOCK(inp) as setting
> socket options does. Why do we just use INP_RLOCK(inp), as it looks not
> changing anything in tcp control block?
> 
> Thank you for your kindly answer.

I think mostly it is just because no one has bothered to change it.  
Realistically it probably won't make any noticable difference unless your 
workload consists of doing lots of calls to getsockopt() but not sending any 
actual traffic on the associated sockets. :)  (Almost all of the other 
operations on a TCP connection require a write lock on the pcb.)

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: why use INP_WLOCK instead of INP_RLOCK

2011-03-25 Thread Robert N. M. Watson

On 25 Mar 2011, at 21:01, John Baldwin wrote:

> On Tuesday, February 01, 2011 12:54:33 am Jim wrote:
>> I am not sure if anybody has asked it before. I could not find answer by
>> doing rough search on Internet, if it is duplicate question, sorry in
>> advance.
>> 
>> My question is that, for getting socket options in tcp_ctloutput() in
>> tcp_usrreq.c, why do we need to do lock with INP_WLOCK(inp) as setting
>> socket options does. Why do we just use INP_RLOCK(inp), as it looks not
>> changing anything in tcp control block?
> 
> I think mostly it is just because no one has bothered to change it.  
> Realistically it probably won't make any noticable difference unless your 
> workload consists of doing lots of calls to getsockopt() but not sending any 
> actual traffic on the associated sockets. :)  (Almost all of the other 
> operations on a TCP connection require a write lock on the pcb.)

Just to reiterate John's point here: the critical performance paths for TCP 
both require the inpcb lock to be held exclusively (input and output), and 
socket options are typically called from the same user thread doing I/O, 
meaning that acquiring read locks instead of write locks is unlikely to make 
any measurable difference. However, in principle I believe most if not all 
getsockopt()'s in TCP should be fine with just a read lock, and for socket 
options used with UDP, there might well be some benefit to using a read lock, 
since most UDP operations use read locks and note write locks on the inpcb.

I should further note that Jeff Roberson has some exciting in-progress work to 
reduce transmit-input contention on the inpcb that appears to make quite a 
noticeable difference in improving TCP performance. We don't have much global 
lock contention currently when in the steady state, but the per-connection 
locks do get heavily contended. His work is similar to some work done in the 
Linux stack a year or two ago to defer input processing to the user thread 
rather than contending on the inpcb lock, if it's already held. Hopefully we'll 
see the results of that work in 9.0, and possibly backported to 8.x.

I also have a large pending patchset adding connection group support, and 
aligning software lookup tables with hardware work distribution via RSS, which 
is due to go in before 9.0.

Robert___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

PR kern/155772 (&& maybe kern/155555)...

2011-03-25 Thread Sean Chittenden

[ Please keep me CC'ed ]

Hello. It looks like I'm running in to kern/155772. The ticket was sparse on 
details so here's how to reproduce the problem:

I have two routers using openospfd && bgpd:

   br1 <-- igb0 -> br2
|  |
| igb1 | igb1
|  |
+--+
  10.1.1.0/24


br1: ifconfig igb1 inet 10.1.1.2 netmask 255.255.255.0
br2: ifconfig igb1 inet 10.1.1.3 netmask 255.255.255.0


When I restart ospfd on br2 (not reload), *br1* looses its connectivity to all 
directly connected routes (both igb0 and igb1). If I perform a `route -n get 
10.1.1.10` on br1, route(8) returns the right information for igb1, however 
when I ping 10.1.1.10 from br1, I get a network not reachable message. 
`ifconfig igb1` returns the right information as well. If I re-add the IP 
address via `ifconfig igb1 inet 10.1.1.2 netmask 255.255.255.255` I am able to 
reach the 10.1.1.0/24 network. It's like the FIB information is being removed 
for directly connected routes when ospfd updates br1's FIB. I've actually 
observed the same behavior when retracting statically connected routes via 
openbgpd as well (while attempting to trace the root of this problem, I 
configured an iBGP session to use a "next-hop self" rule and observed an 
identical behavior, same with quagga too).

I was thinking this was related to kern/15 but I don't think that's the 
direct problem.  http://www.freebsd.org/cgi/query-pr.cgi?pr=15

If anyone has patches or ideas, let me know. I have plenty of details if 
anyone's interested, but this seems so easy to reproduce or like a glaring 
regression. It feels very much like a refcount for routes isn't being 
incremented along the way or that some piece of logic is removing all matching 
routes, not just the routes learned from a routing protocol.


On a similar, but unrelated note of carp(4), I ran in to the problem described 
in the following post during the initial setup of these two routers:

http://lists.freebsd.org/pipermail/freebsd-net/2011-March/028349.html

This happened because br2's default route was pointing to br1, br1 was 
receiving its own carp packets while br2 was forwarding them back. Recompiling 
with MROUTING and enabling igmpproxy(8) seemed to fix the problem.

Thanks in advance. -sc


PS This one also bit me during turn up: 
http://www.freebsd.org/cgi/query-pr.cgi?pr=155030

PPS And, if you had a downstream hosts that have pf(4) enabled and had their 
default route set to 10.1.1.2, once you changed the default route to 10.1.1.1 
or 10.1.1.3, all of the hosts that had pf(4) on them had to have their firewall 
rules reloaded (e.g. /etc/rc.d/pf reload) to reflect this changed default 
route. In previous 7.X, pf(4) picked up this change without needing to reload 
the rules.


--
Sean Chittenden
s...@chittenden.org

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: The tale of a TCP bug

Re: The tale of a TCP bug

Re: The tale of a TCP bug

Re: why use INP_WLOCK instead of INP_RLOCK

Re: why use INP_WLOCK instead of INP_RLOCK

PR kern/155772 (&& maybe kern/155555)...

6 matches

Site Navigation

Mail list logo

Footer information