On Mon, Jun 13, 2016 at 1:45 PM, Daniel Metz <dm...@mytum.de> wrote: > This patch adjusts Linux RTO calculation to be RFC6298 Standard > compliant. MinRTO is no longer added to the computed RTO, RTO damping > and overestimation are decreased. > > In RFC 6298 Standard TCP Retransmission Timeout (RTO) calculation the > calculated RTO is rounded up to the Minimum RTO (MinRTO), if it is > less. The Linux implementation as a discrepancy to the Standard > basically adds the defined MinRTO to the calculated RTO. When > comparing both approaches, the Linux calculation seems to perform > worse for sender limited TCP flows like Telnet, SSH or constant bit > rate encoded transmissions, especially for Round Trip Times (RTT) of > 50ms to 800ms. > > Compared to the Linux implementation the RFC 6298 proposed RTO > calculation performs better and more precise in adapting to current > network characteristics. Extensive measurements for bulk data did not > show a negative impact of the adjusted calculation. > > Exemplary performance comparison for sender-limited-flows: > > - Rate: 10Mbit/s > - Delay: 200ms, Delay Variation: 10ms > - Time between each scheduled segment: 1s > - Amount Data Segments: 300 > - Mean of 11 runs > > Mean Response Waiting Time [milliseconds] > > PER [%] | 0.5 1 1.5 2 3 5 7 10 > --------+------------------------------------------------------- > old | 206.4 208.6 218.0 218.6 227.2 249.3 274.7 308.2 > new | 203.9 206.0 207.0 209.9 217.3 225.6 238.7 259.1 > > > Detailed Analysis: > > https://docs.google.com/document/d/1pKmPfnQb6fDK4qpiNVkN8cQyGE4wYDZukcuZfR-BnnM/ Thanks for the patch. I also have long wanted to evaluate Linux's RTO vs RFC's.
Since this is not a small change, and your patch is only tested on emulation-based testbed AFAICT, I'd like to try your patch on Google servers to get more data. But this would take a few days to setup & collect. Note that this paper https://www.cs.helsinki.fi/research/iwtcp/papers/linuxtcp.pdf has detailed rationale of current design (section 4). IMO having a "tight" RTO is less necessary now after TLP. I am also testing a new set of patches to install a quick reordering timer. But it's worth mentioning the paper in the commit message. > > > Signed-off-by: Hagen Paul Pfeifer <ha...@jauu.net> > Signed-off-by: Daniel Metz <dm...@mytum.de> > Cc: Eric Dumazet <eduma...@google.com> > Cc: Yuchung Cheng <ych...@google.com> > Cc: Neal Cardwell <ncardw...@google.com> > --- > > We removed outdated comments in the code, though more cleanup may required. > > > net/ipv4/tcp_input.c | 74 > ++++++++++++---------------------------------------- > 1 file changed, 17 insertions(+), 57 deletions(-) > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index d6c8f4cd0..a0f66f8 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -680,8 +680,7 @@ static void tcp_event_data_recv(struct sock *sk, struct > sk_buff *skb) > /* Called to compute a smoothed rtt estimate. The data fed to this > * routine either comes from timestamps, or from segments that were > * known _not_ to have been retransmitted [see Karn/Partridge > - * Proceedings SIGCOMM 87]. The algorithm is from the SIGCOMM 88 > - * piece by Van Jacobson. > + * Proceedings SIGCOMM 87]. > * NOTE: the next three routines used to be one big routine. > * To save cycles in the RFC 1323 implementation it was better to break > * it up into three procedures. -- erics > @@ -692,59 +691,21 @@ static void tcp_rtt_estimator(struct sock *sk, long > mrtt_us) > long m = mrtt_us; /* RTT */ > u32 srtt = tp->srtt_us; > > - /* The following amusing code comes from Jacobson's > - * article in SIGCOMM '88. Note that rtt and mdev > - * are scaled versions of rtt and mean deviation. > - * This is designed to be as fast as possible > - * m stands for "measurement". > - * > - * On a 1990 paper the rto value is changed to: > - * RTO = rtt + 4 * mdev > - * > - * Funny. This algorithm seems to be very broken. > - * These formulae increase RTO, when it should be decreased, increase > - * too slowly, when it should be increased quickly, decrease too > quickly > - * etc. I guess in BSD RTO takes ONE value, so that it is absolutely > - * does not matter how to _calculate_ it. Seems, it was trap > - * that VJ failed to avoid. 8) > - */ > if (srtt != 0) { > - m -= (srtt >> 3); /* m is now error in rtt est */ > - srtt += m; /* rtt = 7/8 rtt + 1/8 new */ > - if (m < 0) { > - m = -m; /* m is now abs(error) */ > - m -= (tp->mdev_us >> 2); /* similar update on mdev > */ > - /* This is similar to one of Eifel findings. > - * Eifel blocks mdev updates when rtt decreases. > - * This solution is a bit different: we use finer gain > - * for mdev in this case (alpha*beta). > - * Like Eifel it also prevents growth of rto, > - * but also it limits too fast rto decreases, > - * happening in pure Eifel. > - */ > - if (m > 0) > - m >>= 3; > - } else { > - m -= (tp->mdev_us >> 2); /* similar update on mdev > */ > - } > - tp->mdev_us += m; /* mdev = 3/4 mdev + 1/4 new > */ > - if (tp->mdev_us > tp->mdev_max_us) { > - tp->mdev_max_us = tp->mdev_us; > - if (tp->mdev_max_us > tp->rttvar_us) > - tp->rttvar_us = tp->mdev_max_us; > - } > - if (after(tp->snd_una, tp->rtt_seq)) { > - if (tp->mdev_max_us < tp->rttvar_us) > - tp->rttvar_us -= (tp->rttvar_us - > tp->mdev_max_us) >> 2; > + m -= (srtt >> 3); /* m' = m - srtt / 8 = (R' - SRTT) */ > + srtt += m; /* srtt = srtt + m’ = srtt + m - > srtt / 8 */ > + if (m < 0) > + m = -m; > + m -= (tp->mdev_us >> 2); /* m'' = |m'| - mdev / 4 */ > + tp->mdev_us += m; > + tp->rttvar_us = tp->mdev_us; > + if (after(tp->snd_una, tp->rtt_seq)) > tp->rtt_seq = tp->snd_nxt; > - tp->mdev_max_us = tcp_rto_min_us(sk); > - } > } else { > /* no previous measure. */ > - srtt = m << 3; /* take the measured time to be rtt */ > - tp->mdev_us = m << 1; /* make sure rto = 3*rtt */ > - tp->rttvar_us = max(tp->mdev_us, tcp_rto_min_us(sk)); > - tp->mdev_max_us = tp->rttvar_us; > + srtt = m << 3; > + tp->mdev_us = m << 1; > + tp->rttvar_us = tp->mdev_us; > tp->rtt_seq = tp->snd_nxt; > } > tp->srtt_us = max(1U, srtt); > @@ -798,6 +759,7 @@ static void tcp_update_pacing_rate(struct sock *sk) > */ > static void tcp_set_rto(struct sock *sk) > { > + const u32 min_rto = tcp_rto_min_us(sk); > const struct tcp_sock *tp = tcp_sk(sk); > /* Old crap is replaced with new one. 8) > * > @@ -809,17 +771,15 @@ static void tcp_set_rto(struct sock *sk) > * is invisible. Actually, Linux-2.4 also generates erratic > * ACKs in some circumstances. > */ > - inet_csk(sk)->icsk_rto = __tcp_set_rto(tp); > - > + if (((tp->srtt_us >> 3) + tp->rttvar_us) < min_rto) > + inet_csk(sk)->icsk_rto = usecs_to_jiffies(min_rto); > + else > + inet_csk(sk)->icsk_rto = __tcp_set_rto(tp); > /* 2. Fixups made earlier cannot be right. > * If we do not estimate RTO correctly without them, > * all the algo is pure shit and should be replaced > * with correct one. It is exactly, which we pretend to do. > */ > - > - /* NOTE: clamping at TCP_RTO_MIN is not required, current algo > - * guarantees that rto is higher. > - */ > tcp_bound_rto(sk); > } > > -- > 2.8.3 >