This is a very bizarre edge case, so bear with me. I was debugging an edge case at work recently that occurred when the socket buffer was filled up exactly (i.e. sbspace(&so->so_rcv) == 0). In TCP terms, this would be when rcv_nxt == rcv_adv. To simulate the real workload I had a very fast writer blasting over lo0 to a slow reader but had used a small buffer and turned off window scaling (as I had to fill the entire socket buffer, I was chasing an off-by-1 bug). However, I ended up with some bizarre behavior.
I think it is less confusing to describe the sequence of events that I now know happened than how I figured this out, so here goes. - Assume we have advertised a window size of N which corresponds exactly to sbspace(&so->so_rcv). - The remote peer sends a packet of length N filling our window. We respond with a zero-window ACK. This advances rcv_nxt to == rcv_adv, but it does not grow rcv_adv because sbspace() is currently 0. - The userland app very slowly drains data from the socket buffer. However, the calls to tcp_usr_recvd() do not trigger a window update because in this case the link is over lo0 which has a relatively large t_maxseg (about 14k) and this condition in tcp_output() is not met: if (adv >= (long) (2 * tp->t_maxseg)) goto send; if (2 * adv >= (long) so->so_rcv.sb_hiwat) goto send; - A timer at the remote peer expires and it sends a window probe with one byte of data. Since userland has read some data (just not 2 * MSS), we accept this packet. However, receiving this packet moves rcv_nxt += 1, so rcv_nxt is now > rcv_adv. - We call tcp_output() to ACK the window probe and as part of this calculate the receive window to advertise here: if (recwin < (long)(so->so_rcv.sb_hiwat / 4) && recwin < (long)tp->t_maxseg) recwin = 0; if (recwin < (long)(tp->rcv_adv - tp->rcv_nxt)) recwin = (long)(tp->rcv_adv - tp->rcv_nxt); if (recwin > (long)TCP_MAXWIN << tp->rcv_scale) recwin = (long)TCP_MAXWIN << tp->rcv_scale; The "surprise" kicks in on the second conditional. The problem is that rcv_adv - rcv_nxt is now equal to (uint32_t)-1. On a 32-bit machine the cast to (long) effectively just makes this value signed and thus -1. On a 64-bit machine you actually end up with a ginormous value of 2^32 - 1, or a 4GB window (minus a byte). The third conditional truncates that to the maximum window we can advertise, but this value may be larger than the actual space in the socket buffer. The remote peer now has a huge window to throw data into. At work this proved disastrous. I'm not sure if there are any practical concerns. This is the patch I'm using as a fix: Index: tcp_output.c =================================================================== --- tcp_output.c (revision 215582) +++ tcp_output.c (working copy) @@ -928,7 +928,8 @@ if (recwin < (long)(so->so_rcv.sb_hiwat / 4) && recwin < (long)tp->t_maxseg) recwin = 0; - if (recwin < (long)(tp->rcv_adv - tp->rcv_nxt)) + if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt) && + recwin < (long)(tp->rcv_adv - tp->rcv_nxt)) recwin = (long)(tp->rcv_adv - tp->rcv_nxt); if (recwin > (long)TCP_MAXWIN << tp->rcv_scale) recwin = (long)TCP_MAXWIN << tp->rcv_scale; -- John Baldwin _______________________________________________ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"