From: Simon Baatz <[email protected]> On connection initiation, window_clamp is limited to the maximum value representable for the connection's window scale factor.
However, window_clamp may be changed later when: - it needs to be adjusted due to scaling_ratio changes - the receive buffer grows due to autotuning - the TCP_WINDOW_CLAMP socket option is set In all cases, window_clamp must not end up higher than the maximum representable advertised window. Thus, if the TCP connection state indicates that we can rely on rx_opt.rcv_wscale, clamp the new window_clamp to the maximum window for that scaling factor (including the "no window scaling" case where rcv_wscale is zero). This has visible consequences for calculations based on rcv_wnd. For example, the logic in __tcp_ack_snd_check() uses the advance of the right edge of the receive window to determine when to send an immediate ACK. If rcv_wnd does not properly reflect the "on the wire" advertised window (i.e. it is much higher than the maximum value representable), this calculation will be wrong and ACKs may be delayed when they should be sent immediately. One concrete example is when the TCP receive buffer is much larger than 64KB, but no window scaling is used. If window_clamp (and thus rcv_wnd) are not limited to 65535, the "internal" window based on rcv_wnd can extend far beyond the 16‑bit window actually advertised on the wire. After receiving a data segment, the right edge of the "on the wire" window can be moved (as there is plenty of space in rcv_wnd) and an immediate ACK should be sent. But, it won't do so if the calculation based on rcv_wnd does not happen to change "internal" window right edge. Signed-off-by: Simon Baatz <[email protected]> --- net/ipv4/tcp.c | 4 ++++ net/ipv4/tcp_input.c | 6 ++++-- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index e57eaffc007a0..bd03c99f793ae 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -3802,6 +3802,10 @@ int tcp_set_window_clamp(struct sock *sk, int val) old_window_clamp = tp->window_clamp; new_window_clamp = max_t(int, SOCK_MIN_RCVBUF / 2, val); + if ((1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | + TCPF_FIN_WAIT1 | TCPF_FIN_WAIT2)) + new_window_clamp = min_t(u32, U16_MAX << tp->rx_opt.rcv_wscale, new_window_clamp); + if (new_window_clamp == old_window_clamp) return 0; diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 505884dcb7a2b..6e9123c98152f 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -914,6 +914,7 @@ void tcp_rcvbuf_grow(struct sock *sk, u32 newval) struct tcp_sock *tp = tcp_sk(sk); u32 rcvwin, rcvbuf, cap, oldval; u32 rtt_threshold, rtt_us; + u32 window_clamp; u64 grow; oldval = tp->rcvq_space.space; @@ -949,8 +950,9 @@ void tcp_rcvbuf_grow(struct sock *sk, u32 newval) if (rcvbuf > sk->sk_rcvbuf) { WRITE_ONCE(sk->sk_rcvbuf, rcvbuf); /* Make the window clamp follow along. */ - WRITE_ONCE(tp->window_clamp, - tcp_win_from_space(sk, rcvbuf)); + window_clamp = tcp_win_from_space(sk, rcvbuf); + window_clamp = min_t(u32, U16_MAX << tp->rx_opt.rcv_wscale, window_clamp); + WRITE_ONCE(tp->window_clamp, window_clamp); } } /* -- 2.53.0

