Yes, I am convinced :-) Thanks to Eric, Neal and Yuchung for their help. -- Enke
On Wed, Jan 13, 2021 at 01:20:55PM -0800, Yuchung Cheng wrote: > On Wed, Jan 13, 2021 at 12:49 PM Eric Dumazet <eduma...@google.com> wrote: > > > > On Wed, Jan 13, 2021 at 9:12 PM Enke Chen <enkechen2...@gmail.com> wrote: > > > > > > From: Enke Chen <enc...@paloaltonetworks.com> > > > > > > The TCP session does not terminate with TCP_USER_TIMEOUT when data > > > remain untransmitted due to zero window. > > > > > > The number of unanswered zero-window probes (tcp_probes_out) is > > > reset to zero with incoming acks irrespective of the window size, > > > as described in tcp_probe_timer(): > > > > > > RFC 1122 4.2.2.17 requires the sender to stay open indefinitely > > > as long as the receiver continues to respond probes. We support > > > this by default and reset icsk_probes_out with incoming ACKs. > > > > > > This counter, however, is the wrong one to be used in calculating the > > > duration that the window remains closed and data remain untransmitted. > > > Thanks to Jonathan Maxwell <jmaxwel...@gmail.com> for diagnosing the > > > actual issue. > > > > > > In this patch a separate counter is introduced to track the number of > > > zero-window probes that are not answered with any non-zero window ack. > > > This new counter is used in determining when to abort the session with > > > TCP_USER_TIMEOUT. > > > > > > > I think one possible issue would be that local congestion (full qdisc) > > would abort early, > > because tcp_model_timeout() assumes linear backoff. > Yes exactly. if ZWPs are dropped due to local congestion, the > model_timeout computes incorrectly. Therefore having a starting > timestamp is the surest way b/c it does not assume any specific > backoff behavior. > > > > > Neal or Yuchung can further comment on that, it is late for me in France. > > > > packetdrill test would be : > > > > 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 > > +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > > +0 bind(3, ..., ...) = 0 > > +0 listen(3, 1) = 0 > > > > > > +0 < S 0:0(0) win 0 <mss 1460> > > +0 > S. 0:0(0) ack 1 <mss 1460> > > > > +.1 < . 1:1(0) ack 1 win 65530 > > +0 accept(3, ..., ...) = 4 > > > > +0 setsockopt(4, SOL_TCP, TCP_USER_TIMEOUT, [3000], 4) = 0 > > +0 write(4, ..., 24) = 24 > > +0 > P. 1:25(24) ack 1 > > +.1 < . 1:1(0) ack 25 win 65530 > > +0 %{ assert tcpi_probes == 0, tcpi_probes; \ > > assert tcpi_backoff == 0, tcpi_backoff }% > > > > // install a qdisc dropping all packets > > +0 `tc qdisc delete dev tun0 root 2>/dev/null ; tc qdisc add dev > > tun0 root pfifo limit 0` > > +0 write(4, ..., 24) = 24 > > // When qdisc is congested we retry every 500ms therefore in theory > > // we'd retry 6 times before hitting 3s timeout. However, since we > > // estimate the elapsed time based on exp backoff of actual RTO (300ms), > > // we'd bail earlier with only 3 probes. > > +2.1 write(4, ..., 24) = -1 > > +0 %{ assert tcpi_probes == 3, tcpi_probes; \ > > assert tcpi_backoff == 0, tcpi_backoff }% > > +0 close(4) = 0 > >