On Wed, Apr 3, 2019 at 8:13 PM brakmo <bra...@fb.com> wrote: > > When a packet is dropped when calling queue_xmit in __tcp_transmit_skb > and packets_out is 0, it is beneficial to set a small probe timer. > Otherwise, the throughput for the flow can suffer because it may need to > depend on the probe timer to start sending again. The default value for > the probe timer is at least 200ms, this patch sets it to 20ms when a > packet is dropped and there are no other packets in flight. > > This patch introduces a new sysctl, sysctl_tcp_probe_on_drop_ms, that is > used to specify the duration of the probe timer for the case described > earlier. The allowed values are between 0 and TCP_RTO_MIN. A value of 0 > disables setting the probe timer with a small value. > > Signed-off-by: Lawrence Brakmo <bra...@fb.com> > --- > include/net/netns/ipv4.h | 1 + > net/ipv4/sysctl_net_ipv4.c | 10 ++++++++++ > net/ipv4/tcp_ipv4.c | 1 + > net/ipv4/tcp_output.c | 18 +++++++++++++++--- > 4 files changed, 27 insertions(+), 3 deletions(-) > > diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h > index 7698460a3dd1..26c52d294dea 100644 > --- a/include/net/netns/ipv4.h > +++ b/include/net/netns/ipv4.h > @@ -166,6 +166,7 @@ struct netns_ipv4 { > int sysctl_tcp_wmem[3]; > int sysctl_tcp_rmem[3]; > int sysctl_tcp_comp_sack_nr; > + int sysctl_tcp_probe_on_drop_ms; > unsigned long sysctl_tcp_comp_sack_delay_ns; > struct inet_timewait_death_row tcp_death_row; > int sysctl_max_syn_backlog; > diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c > index 2316c08e9591..b1e4420b6d6c 100644 > --- a/net/ipv4/sysctl_net_ipv4.c > +++ b/net/ipv4/sysctl_net_ipv4.c > @@ -49,6 +49,7 @@ static int ip_ping_group_range_min[] = { 0, 0 }; > static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX }; > static int comp_sack_nr_max = 255; > static u32 u32_max_div_HZ = UINT_MAX / HZ; > +static int probe_on_drop_max = TCP_RTO_MIN; > > /* obsolete */ > static int sysctl_tcp_low_latency __read_mostly; > @@ -1228,6 +1229,15 @@ static struct ctl_table ipv4_net_table[] = { > .extra1 = &zero, > .extra2 = &comp_sack_nr_max, > }, > + { > + .procname = "tcp_probe_on_drop_ms", > + .data = &init_net.ipv4.sysctl_tcp_probe_on_drop_ms, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = proc_dointvec_minmax,
The tcp_reset_xmit_timer () function expects a time in jiffies, so this would probably need to be proc_dointvec_ms_jiffies? > + .extra1 = &zero, > + .extra2 = &probe_on_drop_max, > + }, > { > .procname = "udp_rmem_min", > .data = &init_net.ipv4.sysctl_udp_rmem_min, > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > index 3979939804b7..3df6735f0f07 100644 > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -2686,6 +2686,7 @@ static int __net_init tcp_sk_init(struct net *net) > spin_lock_init(&net->ipv4.tcp_fastopen_ctx_lock); > net->ipv4.sysctl_tcp_fastopen_blackhole_timeout = 60 * 60; > atomic_set(&net->ipv4.tfo_active_disable_times, 0); > + net->ipv4.sysctl_tcp_probe_on_drop_ms = 20; The tcp_reset_xmit_timer () function expects a time in jiffies, so this would probably need to be msecs_to_jiffies(20)? > > /* Reno is always built in */ > if (!net_eq(net, &init_net) && > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index e265d1aeeb66..f101e4d7c1ae 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -1154,9 +1154,21 @@ static int __tcp_transmit_skb(struct sock *sk, struct > sk_buff *skb, > > err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl); > > - if (unlikely(err > 0)) { > - tcp_enter_cwr(sk); > - err = net_xmit_eval(err); > + if (unlikely(err)) { > + if (unlikely(err > 0)) { > + tcp_enter_cwr(sk); > + err = net_xmit_eval(err); > + } > + /* Packet was dropped. If there are no packets out, > + * we may need to depend on probe timer to start sending > + * again. Hence, use a smaller value. > + */ > + if (!tp->packets_out && !inet_csk(sk)->icsk_pending && > + sock_net(sk)->ipv4.sysctl_tcp_probe_on_drop_ms > 0) { > + tcp_reset_xmit_timer(sk, ICSK_TIME_PROBE0, > + > sock_net(sk)->ipv4.sysctl_tcp_probe_on_drop_ms, > + TCP_RTO_MAX, NULL); > + } My reading of the code is that any timer that was set here would typically be quickly overridden by the tcp_check_probe_timer() that usually happens after most tcp_write_xmit() calls. Am I reading that correctly? There seems to be a philosophical difference between this patch and the existing logic in tcp_check_probe_timer()/tcp_probe0_base(): /* Something is really bad, we could not queue an additional packet, * because qdisc is full or receiver sent a 0 window, or we are paced. * We do not want to add fuel to the fire, or abort too early, * so make sure the timer we arm now is at least 200ms in the future, * regardless of current icsk_rto value (as it could be ~2ms) */ static inline unsigned long tcp_probe0_base(const struct sock *sk) { return max_t(unsigned long, inet_csk(sk)->icsk_rto, TCP_RTO_MIN); } ... static inline void tcp_check_probe_timer(struct sock *sk) { if (!tcp_sk(sk)->packets_out && !inet_csk(sk)->icsk_pending) tcp_reset_xmit_timer(sk, ICSK_TIME_PROBE0, tcp_probe0_base(sk), TCP_RTO_MAX, NULL); } If we apply some version of this patch then we should probably at least update this comment above tcp_probe0_base() to clarify the new strategy. cheers, neal