Re: Some questions about the new TCP congestion control code

Lawrence Stewart Thu, 24 Jan 2013 05:28:46 -0800

On 01/16/13 06:27, John Baldwin wrote:
> On Tuesday, January 15, 2013 3:29:51 am Lawrence Stewart wrote:
>> Hi John,
>>
>> On 01/15/13 08:04, John Baldwin wrote:
>>> I was looking at TCP congestion control at work recently and noticed a few 
>>
>> Poor you ;)
>>
>>> "odd" things in the current code.  First, there is this chunk of code in 
>>> cc_ack_received() in tcp_input.c:
>>>
>>> static void inline
>>> cc_ack_received(struct tcpcb *tp, struct tcphdr *th, uint16_t type)
>>> {
>>>     INP_WLOCK_ASSERT(tp->t_inpcb);
>>>
>>>     tp->ccv->bytes_this_ack = BYTES_THIS_ACK(tp, th);
>>>     if (tp->snd_cwnd == min(tp->snd_cwnd, tp->snd_wnd))
>>>             tp->ccv->flags |= CCF_CWND_LIMITED;
>>>     else
>>>             tp->ccv->flags &= ~CCF_CWND_LIMITED;
>>>
>>>
>>> Due to hysterical raisins, snd_cwnd and snd_wnd are u_long values, not 
>>> integers, so the call to min() results in truncation on 64-bit hosts.
>>
>> Good catch, but I don't think it matters in practice as neither snd_cwnd
>> or snd_wnd will grow past the 32-bit boundary.
> 
> I have a psyhcotic case using cc_cubic where it seems to grow without bound,
> though that is a bug in and of itself (and this change did not fix that
> issue).  I ended up not using cc_cubic (more below) and haven't been able
> to track down the root cause of the delay.  I can probably provide a test case
> to reproduce this if you are interested.


hmm I'd certainly be interested in hearing more about this issue with
cubic. If you think a test case is easy to come up with, please shoot it
through to me when you have the chance.

>>> It should probably be ulmin() instead.  However, this line seems to be a 
>>> really 
>>> obfuscated way to just write:
>>>
>>>     if (tp->snd_cwnd <= tp->snd_wnd)
>>
>> You are correct, though I'd argue the meaning of the existing code as
>> written is clearer compared to your suggested change.
>>
>>> If that is correct, I would vote for changing this to use the much simpler 
>>> logic.
>>
>> Agreed. While I find the existing code slightly clearer in meaning, it's
>> not significant enough to warrant keeping it as is when your suggested
>> change is simpler, fixes a bug and achieves the same thing. Happy for
>> you to change it or I can do it if you prefer.
> 
> I'll leave that to you, thanks.

Committed as r245783.

>>> Secondly, in the particular case I was investigating at work (restart of an 
>>> idle connnection), the newreno congestion control code in 8.x and later 
>>> uses a 
>>> different algorithm than in 7.  Specifically, in 7 TCP would reuse the same 
>>> logic used for an initial cwnd (honoring ss_fltsz).  In 8 this no longer 
>>> happens (instead, 2 is hardcoded).  A guess at a possible fix might look 
>>> something like this:
>>>
>>> Index: cc_newreno.c
>>> ===================================================================
>>> --- cc_newreno.c    (revision 243660)
>>> +++ cc_newreno.c    (working copy)
>>> @@ -169,8 +169,21 @@ newreno_after_idle(struct cc_var *ccv)
>>>     if (V_tcp_do_rfc3390)
>>>             rw = min(4 * CCV(ccv, t_maxseg),
>>>                 max(2 * CCV(ccv, t_maxseg), 4380));
>>> +#if 1
>>>     else
>>>             rw = CCV(ccv, t_maxseg) * 2;
>>> +#else
>>> +   /* XXX: This is missing a lot of stuff that used to be in 7. */
>>> +#ifdef INET6
>>> +   else if ((isipv6 ? in6_localaddr(&CCV(ccv, t_inpcb->in6p_faddr)) :
>>> +           in_localaddr(CCV(ccv, t_inpcb->inp_faddr))))
>>> +#else
>>> +   else if (in_localaddr(CCV(ccv, t_inpcb->inp_faddr)))
>>> +#endif
>>> +           rw = V_ss_fltsz_local * CCV(ccv, t_maxseg);
>>> +   else
>>> +           rw = V_ss_fltsz * CCV(ccv, t_maxseg);
>>> +#endif
>>>  
>>>     CCV(ccv, snd_cwnd) = min(rw, CCV(ccv, snd_cwnd));
>>>  }
>>>
>>> (But using the #else clause instead of the current #if 1 code).  Was this 
>>> change in 8 intentional?
>>
>> It was. Unlike connection initialisation which still honours ss_fltsz in
>> cc_conn_init(), restarting an idle connection based on ss_fltsz seemed
>> particularly dubious and as such was omitted from the refactored code.
>>
>> The ultimate goal was to remove the ss_fltsz hack completely and
>> implement a smarter mechanism, but that hasn't quite happened yet. The
>> removal of ss_fltsz from 10.x without providing a replacement mechanism
>> is not ideal and should probably be addressed.
>>
>> I'm guessing you're not using rfc3390 because you want to override the
>> initial window based on specific local knowledge of the path between
>> sender and receiver?
> 
> Correct, in 7.x we had cranked ss_fltsz up to a really high number to prevent
> the congestion window from collapsing when the connection was idle.  We have
> a bit of a unique workload in that we are using TCP to reliably forward a
> latency-sensitive datagram stream across a WAN connection with high bandwidth
> and high RTT.  Most of congestion control seems tuned to bulk transfers rather
> than this sort of use case.  The solution we have settled on here is to add a
> new TCP socket option to disable idle handling so that when an idle connection
> restarts it keeps its prior congestion window.

I'll respond to this in detail in the other thread.

> One other thing I noticed which is may or may not be odd during this, is that
> if you have a connection with TCP_NODELAY enabled and you fill your cwnd and
> then you get an ACK back for an earlier small segment (less than MSS), TCP
> will not send out a "short" segment for the amount of window space released.
> Instead, it will wait until a full MSS of space is available before sending
> a packet.  I'm not sure if that is the correct behavior with TCP_NODELAY or
> if we should send "short" segments in that case.

We try fairly hard not to send runt segments irrespective of NODELAY,
but I would be happy to see that change. I'm not aware of any "correct
behaviour" we have to adhere to - I think it would be perfectly
reasonable to have a sysctl set the lowest number of bytes we'd be
willing to send a runt segment for and then key off TCP_NODELAY as to
whether we try hard to send an MSS worth or send as soon as we have the
min number of bytes worth of window available.

Cheers,
Lawrence
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Some questions about the new TCP congestion control code

Reply via email to