On Wed, Aug 17, 2016 at 5:17 PM, Eric Dumazet <eric.duma...@gmail.com> wrote: > From: Eric Dumazet <eduma...@google.com> > > Over the years, TCP BDP has increased a lot, and is typically > in the order of ~10 Mbytes with help of clever Congestion Control > modules. > > In presence of packet losses, TCP stores incoming packets into an out of > order queue, and number of skbs sitting there waiting for the missing > packets to be received can match the BDP (~10 Mbytes) > > In some cases, TCP needs to make room for incoming skbs, and current > strategy can simply remove all skbs in the out of order queue as a last > resort, incurring a huge penalty, both for receiver and sender. > > Unfortunately these 'last resort events' are quite frequent, forcing > sender to send all packets again, stalling the flow and wasting a lot of > resources. > > This patch cleans only a part of the out of order queue in order > to meet the memory constraints. > > Signed-off-by: Eric Dumazet <eduma...@google.com> > Cc: Neal Cardwell <ncardw...@google.com> > Cc: Yuchung Cheng <ych...@google.com> > Cc: Soheil Hassas Yeganeh <soh...@google.com> > Cc: C. Stephen Gun <c...@google.com> > Cc: Van Jacobson <v...@google.com>
Acked-by: Soheil Hassas Yeganeh <soh...@google.com> > --- > net/ipv4/tcp_input.c | 47 ++++++++++++++++++++++++----------------- > 1 file changed, 28 insertions(+), 19 deletions(-) > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index > 3ebf45b38bc309f448dbc4f27fe8722cefabaf19..8cd02c0b056cbc22e2e4a4fe8530b74f7bd25419 > 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -4392,12 +4392,9 @@ static int tcp_try_rmem_schedule(struct sock *sk, > struct sk_buff *skb, > if (tcp_prune_queue(sk) < 0) > return -1; > > - if (!sk_rmem_schedule(sk, skb, size)) { > + while (!sk_rmem_schedule(sk, skb, size)) { > if (!tcp_prune_ofo_queue(sk)) > return -1; > - > - if (!sk_rmem_schedule(sk, skb, size)) > - return -1; > } > } > return 0; > @@ -4874,29 +4871,41 @@ static void tcp_collapse_ofo_queue(struct sock *sk) > } > > /* > - * Purge the out-of-order queue. > - * Return true if queue was pruned. > + * Clean the out-of-order queue to make room. > + * We drop high sequences packets to : > + * 1) Let a chance for holes to be filled. > + * 2) not add too big latencies if thousands of packets sit there. > + * (But if application shrinks SO_RCVBUF, we could still end up > + * freeing whole queue here) > + * > + * Return true if queue has shrunk. > */ > static bool tcp_prune_ofo_queue(struct sock *sk) > { > struct tcp_sock *tp = tcp_sk(sk); > - bool res = false; > + struct sk_buff *skb; > > - if (!skb_queue_empty(&tp->out_of_order_queue)) { > - NET_INC_STATS(sock_net(sk), LINUX_MIB_OFOPRUNED); > - __skb_queue_purge(&tp->out_of_order_queue); > + if (skb_queue_empty(&tp->out_of_order_queue)) > + return false; > > - /* Reset SACK state. A conforming SACK implementation will > - * do the same at a timeout based retransmit. When a > connection > - * is in a sad state like this, we care only about integrity > - * of the connection not performance. > - */ > - if (tp->rx_opt.sack_ok) > - tcp_sack_reset(&tp->rx_opt); > + NET_INC_STATS(sock_net(sk), LINUX_MIB_OFOPRUNED); > + > + while ((skb = __skb_dequeue_tail(&tp->out_of_order_queue)) != NULL) { > + tcp_drop(sk, skb); > sk_mem_reclaim(sk); > - res = true; > + if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf && > + !tcp_under_memory_pressure(sk)) > + break; > } > - return res; > + > + /* Reset SACK state. A conforming SACK implementation will > + * do the same at a timeout based retransmit. When a connection > + * is in a sad state like this, we care only about integrity > + * of the connection not performance. > + */ > + if (tp->rx_opt.sack_ok) > + tcp_sack_reset(&tp->rx_opt); > + return true; > } > > /* Reduce allocated memory if we can, trying to get > > Very nice patch, Eric! Thanks.