On 01/16/13 06:27, John Baldwin wrote: > On Tuesday, January 15, 2013 3:29:51 am Lawrence Stewart wrote: >> Hi John, >> >> On 01/15/13 08:04, John Baldwin wrote: >>> I was looking at TCP congestion control at work recently and noticed a few >> >> Poor you ;) >> >>> "odd" things in the current code. First, there is this chunk of code in >>> cc_ack_received() in tcp_input.c: >>> >>> static void inline >>> cc_ack_received(struct tcpcb *tp, struct tcphdr *th, uint16_t type) >>> { >>> INP_WLOCK_ASSERT(tp->t_inpcb); >>> >>> tp->ccv->bytes_this_ack = BYTES_THIS_ACK(tp, th); >>> if (tp->snd_cwnd == min(tp->snd_cwnd, tp->snd_wnd)) >>> tp->ccv->flags |= CCF_CWND_LIMITED; >>> else >>> tp->ccv->flags &= ~CCF_CWND_LIMITED; >>> >>> >>> Due to hysterical raisins, snd_cwnd and snd_wnd are u_long values, not >>> integers, so the call to min() results in truncation on 64-bit hosts. >> >> Good catch, but I don't think it matters in practice as neither snd_cwnd >> or snd_wnd will grow past the 32-bit boundary. > > I have a psyhcotic case using cc_cubic where it seems to grow without bound, > though that is a bug in and of itself (and this change did not fix that > issue). I ended up not using cc_cubic (more below) and haven't been able > to track down the root cause of the delay. I can probably provide a test case > to reproduce this if you are interested.
hmm I'd certainly be interested in hearing more about this issue with cubic. If you think a test case is easy to come up with, please shoot it through to me when you have the chance. >>> It should probably be ulmin() instead. However, this line seems to be a >>> really >>> obfuscated way to just write: >>> >>> if (tp->snd_cwnd <= tp->snd_wnd) >> >> You are correct, though I'd argue the meaning of the existing code as >> written is clearer compared to your suggested change. >> >>> If that is correct, I would vote for changing this to use the much simpler >>> logic. >> >> Agreed. While I find the existing code slightly clearer in meaning, it's >> not significant enough to warrant keeping it as is when your suggested >> change is simpler, fixes a bug and achieves the same thing. Happy for >> you to change it or I can do it if you prefer. > > I'll leave that to you, thanks. Committed as r245783. >>> Secondly, in the particular case I was investigating at work (restart of an >>> idle connnection), the newreno congestion control code in 8.x and later >>> uses a >>> different algorithm than in 7. Specifically, in 7 TCP would reuse the same >>> logic used for an initial cwnd (honoring ss_fltsz). In 8 this no longer >>> happens (instead, 2 is hardcoded). A guess at a possible fix might look >>> something like this: >>> >>> Index: cc_newreno.c >>> =================================================================== >>> --- cc_newreno.c (revision 243660) >>> +++ cc_newreno.c (working copy) >>> @@ -169,8 +169,21 @@ newreno_after_idle(struct cc_var *ccv) >>> if (V_tcp_do_rfc3390) >>> rw = min(4 * CCV(ccv, t_maxseg), >>> max(2 * CCV(ccv, t_maxseg), 4380)); >>> +#if 1 >>> else >>> rw = CCV(ccv, t_maxseg) * 2; >>> +#else >>> + /* XXX: This is missing a lot of stuff that used to be in 7. */ >>> +#ifdef INET6 >>> + else if ((isipv6 ? in6_localaddr(&CCV(ccv, t_inpcb->in6p_faddr)) : >>> + in_localaddr(CCV(ccv, t_inpcb->inp_faddr)))) >>> +#else >>> + else if (in_localaddr(CCV(ccv, t_inpcb->inp_faddr))) >>> +#endif >>> + rw = V_ss_fltsz_local * CCV(ccv, t_maxseg); >>> + else >>> + rw = V_ss_fltsz * CCV(ccv, t_maxseg); >>> +#endif >>> >>> CCV(ccv, snd_cwnd) = min(rw, CCV(ccv, snd_cwnd)); >>> } >>> >>> (But using the #else clause instead of the current #if 1 code). Was this >>> change in 8 intentional? >> >> It was. Unlike connection initialisation which still honours ss_fltsz in >> cc_conn_init(), restarting an idle connection based on ss_fltsz seemed >> particularly dubious and as such was omitted from the refactored code. >> >> The ultimate goal was to remove the ss_fltsz hack completely and >> implement a smarter mechanism, but that hasn't quite happened yet. The >> removal of ss_fltsz from 10.x without providing a replacement mechanism >> is not ideal and should probably be addressed. >> >> I'm guessing you're not using rfc3390 because you want to override the >> initial window based on specific local knowledge of the path between >> sender and receiver? > > Correct, in 7.x we had cranked ss_fltsz up to a really high number to prevent > the congestion window from collapsing when the connection was idle. We have > a bit of a unique workload in that we are using TCP to reliably forward a > latency-sensitive datagram stream across a WAN connection with high bandwidth > and high RTT. Most of congestion control seems tuned to bulk transfers rather > than this sort of use case. The solution we have settled on here is to add a > new TCP socket option to disable idle handling so that when an idle connection > restarts it keeps its prior congestion window. I'll respond to this in detail in the other thread. > One other thing I noticed which is may or may not be odd during this, is that > if you have a connection with TCP_NODELAY enabled and you fill your cwnd and > then you get an ACK back for an earlier small segment (less than MSS), TCP > will not send out a "short" segment for the amount of window space released. > Instead, it will wait until a full MSS of space is available before sending > a packet. I'm not sure if that is the correct behavior with TCP_NODELAY or > if we should send "short" segments in that case. We try fairly hard not to send runt segments irrespective of NODELAY, but I would be happy to see that change. I'm not aware of any "correct behaviour" we have to adhere to - I think it would be perfectly reasonable to have a sysctl set the lowest number of bytes we'd be willing to send a runt segment for and then key off TCP_NODELAY as to whether we try hard to send an MSS worth or send as soon as we have the min number of bytes worth of window available. Cheers, Lawrence _______________________________________________ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"