Hi Simon, It all makes sense to me at a quick look, I have just some nits and one more substantial worry, below:
On Fri, 20 Feb 2026 00:55:14 +0100 Simon Baatz via B4 Relay <[email protected]> wrote: > From: Simon Baatz <[email protected]> > > By default, the Linux TCP implementation does not shrink the > advertised window (RFC 7323 calls this "window retraction") with the > following exceptions: > > - When an incoming segment cannot be added due to the receive buffer > running out of memory. Since commit 8c670bdfa58e ("tcp: correct > handling of extreme memory squeeze") a zero window will be > advertised in this case. It turns out that reaching the required > "memory pressure" is very easy when window scaling is in use. In the > simplest case, sending a sufficient number of segments smaller than > the scale factor to a receiver that does not read data is enough. > > Since commit 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") this > happens much earlier than before, leading to regressions (the test > suite of the Valkey project does not pass because of a TCP > connection that is no longer bi-directional). Ouch. By the way, that same commit helped us unveil an issue (at least in the sense of RFC 9293, 3.8.6) we fixed in passt: https://passt.top/passt/commit/?id=8d2f8c4d0fb58d6b2011e614bc7d7ff9dab406b3 > - Commit b650d953cd39 ("tcp: enforce receive buffer memory limits by > allowing the tcp window to shrink") addressed the "eating memory" > problem by introducing a sysctl knob that allows shrinking the > window before running out of memory. > > However, RFC 7323 does not only state that shrinking the window is > necessary in some cases, it also formulates requirements for TCP > implementations when doing so (Section 2.4). > > This commit addresses the receiver-side requirements: After retracting > the window, the peer may have a snd_nxt that lies within a previously > advertised window but is now beyond the retracted window. This means > that all incoming segments (including pure ACKs) will be rejected > until the application happens to read enough data to let the peer's > snd_nxt be in window again (which may be never). > > To comply with RFC 7323, the receiver MUST honor any segment that > would have been in window for any ACK sent by the receiver and, when > window scaling is in effect, SHOULD track the maximum window sequence > number it has advertised. This patch tracks that maximum window > sequence number throughout the connection and uses it in > tcp_sequence() when deciding whether a segment is acceptable. > Acceptability of data is not changed. > > Fixes: 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze") > Fixes: b650d953cd39 ("tcp: enforce receive buffer memory limits by allowing > the tcp window to shrink") > Signed-off-by: Simon Baatz <[email protected]> > --- > Documentation/networking/net_cachelines/tcp_sock.rst | 1 + > include/linux/tcp.h | 1 + > include/net/tcp.h | 14 > ++++++++++++++ > net/ipv4/tcp_fastopen.c | 1 + > net/ipv4/tcp_input.c | 6 ++++-- > net/ipv4/tcp_minisocks.c | 1 + > net/ipv4/tcp_output.c | 12 ++++++++++++ > .../selftests/net/packetdrill/tcp_rcv_big_endseq.pkt | 2 +- > 8 files changed, 35 insertions(+), 3 deletions(-) > > diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst > b/Documentation/networking/net_cachelines/tcp_sock.rst > index > 563daea10d6c5c074f004cb1b8574f5392157abb..fecf61166a54ee2f64bcef5312c81dcc4aa9a124 > 100644 > --- a/Documentation/networking/net_cachelines/tcp_sock.rst > +++ b/Documentation/networking/net_cachelines/tcp_sock.rst > @@ -121,6 +121,7 @@ u64 delivered_mstamp > read_write > u32 rate_delivered > read_mostly tcp_rate_gen > u32 rate_interval_us > read_mostly rate_delivered,rate_app_limited > u32 rcv_wnd read_write > read_mostly tcp_select_window,tcp_receive_window,tcp_fast_path_check > +u32 rcv_mwnd_seq read_write > tcp_select_window > u32 write_seq read_write > > tcp_rate_check_app_limited,tcp_write_queue_empty,tcp_skb_entail,forced_push,tcp_mark_push > u32 notsent_lowat read_mostly > tcp_stream_memory_free > u32 pushed_seq read_write > tcp_mark_push,forced_push > diff --git a/include/linux/tcp.h b/include/linux/tcp.h > index > f72eef31fa23cc584f2f0cefacdc35cae43aa52d..5a943b12d4c050a980b4cf81635b9fa2f0036283 > 100644 > --- a/include/linux/tcp.h > +++ b/include/linux/tcp.h > @@ -271,6 +271,7 @@ struct tcp_sock { > u32 lsndtime; /* timestamp of last sent data packet (for > restart window) */ > u32 mdev_us; /* medium deviation */ > u32 rtt_seq; /* sequence number to update rttvar */ > + u32 rcv_mwnd_seq; /* Maximum window sequence number (RFC 7323, > section 2.4) */ Nit: tab between ; and /* for consistency (I would personally prefer the comment style as you see on 'highest_sack' but I don't think it's enforced anymore). Second nit: mentioning RFC 7323, section 2.4 could be a bit misleading here because the relevant paragraph there covers a very specific case of window retraction, caused by quantisation error from window scaling, which is not the most common case here. I couldn't quickly find a better reference though. More importantly: do we need to restore this on a connection that's being dumped and recreated using TCP_REPAIR, or will things still work (even though sub-optimally) if we lose this value? Other window values that *need* to be dumped and restored are currently available via TCP_REPAIR_WINDOW socket option, and they are listed in do_tcp_getsockopt(), net/ipv4/tcp.c: opt.snd_wl1 = tp->snd_wl1; opt.snd_wnd = tp->snd_wnd; opt.max_window = tp->max_window; opt.rcv_wnd = tp->rcv_wnd; opt.rcv_wup = tp->rcv_wup; CRIU uses it to checkpoint and restore established connections, and passt uses it to migrate them to a different host: https://criu.org/TCP_connection https://passt.top/passt/tree/tcp.c?id=02af38d4177550c086bae54246fc3aaa33ddc018#n3063 If it's strictly needed to preserve functionality, we would need to add it to struct tcp_repair_window, notify CRIU maintainers (or send them a patch), and add this in passt as well (I can take care of it). Strictly speaking, in case, this could be considered a breaking change for userspace, but I don't see how to avoid it, so I'd just make sure it doesn't impact users as TCP_REPAIR has just a couple of (known!) projects relying on it. An alternative would be to have a special, initial value representing the fact that this value was lost, but it looks really annoying to not be able to use a u32 for it. Disregard all this if the correct value is not strictly needed for functionality, of course. I haven't tested things (not yet, at least). > u64 tcp_wstamp_ns; /* departure time for next sent data packet */ > u64 accecn_opt_tstamp; /* Last AccECN option sent timestamp */ > struct list_head tsorted_sent_queue; /* time-sorted sent but un-SACKed > skbs */ > diff --git a/include/net/tcp.h b/include/net/tcp.h > index > 40e72b9cb85f08714d3f458c0bd1402a5fb1eb4e..e1944d504823d5f8754d85bfbbf3c9630d2190ac > 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -912,6 +912,20 @@ static inline u32 tcp_receive_window(const struct > tcp_sock *tp) > return (u32) win; > } > > +/* Compute the maximum receive window we ever advertised. > + * Rcv_nxt can be after the window if our peer push more data s/push/pushes/ s/Rcv_nxt/rcv_nxt/ (useful for grepping) > + * than the offered window. > + */ > +static inline u32 tcp_max_receive_window(const struct tcp_sock *tp) > +{ > + s32 win = tp->rcv_mwnd_seq - tp->rcv_nxt; > + > + if (win < 0) > + win = 0; I must be missing something but... if the sequence is about to wrap, we'll return 0 here. Is that intended? Doing the subtraction unsigned would have looked more natural to me, but I didn't really think it through. > + return (u32) win; Kernel coding style doesn't usually include a space between cast and identifier. > +} > + > + > /* Choose a new window, without checks for shrinking, and without > * scaling applied to the result. The caller does these things > * if necessary. This is a "raw" window selection. > diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c > index > b30090cff3cf7d925dc46694860abd3ca5516d70..f034ef6e3e7b54bf73c77fd2bf1d3090c75dbfc6 > 100644 > --- a/net/ipv4/tcp_fastopen.c > +++ b/net/ipv4/tcp_fastopen.c > @@ -377,6 +377,7 @@ static struct sock *tcp_fastopen_create_child(struct sock > *sk, > > tcp_rsk(req)->rcv_nxt = tp->rcv_nxt; > tp->rcv_wup = tp->rcv_nxt; > + tp->rcv_mwnd_seq = tp->rcv_wup + tp->rcv_wnd; > /* tcp_conn_request() is sending the SYNACK, > * and queues the child into listener accept queue. > */ > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index > e7b41abb82aad33d8cab4fcfa989cc4771149b41..af9dd51256b01fd31d9e390d69dcb1d1700daf1b > 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -4865,8 +4865,8 @@ static enum skb_drop_reason tcp_sequence(const struct > sock *sk, > if (before(end_seq, tp->rcv_wup)) > return SKB_DROP_REASON_TCP_OLD_SEQUENCE; > > - if (after(end_seq, tp->rcv_nxt + tcp_receive_window(tp))) { > - if (after(seq, tp->rcv_nxt + tcp_receive_window(tp))) > + if (after(end_seq, tp->rcv_nxt + tcp_max_receive_window(tp))) { > + if (after(seq, tp->rcv_nxt + tcp_max_receive_window(tp))) > return SKB_DROP_REASON_TCP_INVALID_SEQUENCE; > > /* Only accept this packet if receive queue is empty. */ > @@ -6959,6 +6959,7 @@ static int tcp_rcv_synsent_state_process(struct sock > *sk, struct sk_buff *skb, > */ > WRITE_ONCE(tp->rcv_nxt, TCP_SKB_CB(skb)->seq + 1); > tp->rcv_wup = TCP_SKB_CB(skb)->seq + 1; > + tp->rcv_mwnd_seq = tp->rcv_wup + tp->rcv_wnd; > > /* RFC1323: The window in SYN & SYN/ACK segments is > * never scaled. > @@ -7071,6 +7072,7 @@ static int tcp_rcv_synsent_state_process(struct sock > *sk, struct sk_buff *skb, > WRITE_ONCE(tp->rcv_nxt, TCP_SKB_CB(skb)->seq + 1); > WRITE_ONCE(tp->copied_seq, tp->rcv_nxt); > tp->rcv_wup = TCP_SKB_CB(skb)->seq + 1; > + tp->rcv_mwnd_seq = tp->rcv_wup + tp->rcv_wnd; > > /* RFC1323: The window in SYN & SYN/ACK segments is > * never scaled. > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c > index > ec128865f5c029c971eb00cb9ee058b742efafd1..df95d8b6dce5c746e5e34545aa75a96080cc752d > 100644 > --- a/net/ipv4/tcp_minisocks.c > +++ b/net/ipv4/tcp_minisocks.c > @@ -604,6 +604,7 @@ struct sock *tcp_create_openreq_child(const struct sock > *sk, > newtp->window_clamp = req->rsk_window_clamp; > newtp->rcv_ssthresh = req->rsk_rcv_wnd; > newtp->rcv_wnd = req->rsk_rcv_wnd; > + newtp->rcv_mwnd_seq = newtp->rcv_wup + req->rsk_rcv_wnd; > newtp->rx_opt.wscale_ok = ireq->wscale_ok; > if (newtp->rx_opt.wscale_ok) { > newtp->rx_opt.snd_wscale = ireq->snd_wscale; > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index > 326b58ff1118d02fc396753d56f210f9d3007c7f..50774443f6ae0ca83f360c7fc3239184a1523e1b > 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -274,6 +274,15 @@ void tcp_select_initial_window(const struct sock *sk, > int __space, __u32 mss, > } > EXPORT_IPV6_MOD(tcp_select_initial_window); > > +/* Check if we need to update the maximum window sequence number */ > +static inline void tcp_update_max_wnd_seq(struct tcp_sock *tp) > +{ > + u32 wre = tp->rcv_wup + tp->rcv_wnd; > + > + if (after(wre, tp->rcv_mwnd_seq)) > + tp->rcv_mwnd_seq = wre; > +} > + > /* Chose a new window to advertise, update state in tcp_sock for the > * socket, and return result with RFC1323 scaling applied. The return > * value can be stuffed directly into th->window for an outgoing > @@ -293,6 +302,7 @@ static u16 tcp_select_window(struct sock *sk) > tp->pred_flags = 0; > tp->rcv_wnd = 0; > tp->rcv_wup = tp->rcv_nxt; > + tcp_update_max_wnd_seq(tp); > return 0; > } > > @@ -316,6 +326,7 @@ static u16 tcp_select_window(struct sock *sk) > > tp->rcv_wnd = new_win; > tp->rcv_wup = tp->rcv_nxt; > + tcp_update_max_wnd_seq(tp); > > /* Make sure we do not exceed the maximum possible > * scaled window. > @@ -4169,6 +4180,7 @@ static void tcp_connect_init(struct sock *sk) > else > tp->rcv_tstamp = tcp_jiffies32; > tp->rcv_wup = tp->rcv_nxt; > + tp->rcv_mwnd_seq = tp->rcv_nxt + tp->rcv_wnd; > WRITE_ONCE(tp->copied_seq, tp->rcv_nxt); > > inet_csk(sk)->icsk_rto = tcp_timeout_init(sk); > diff --git a/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt > b/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt > index > 3848b419e68c3fc895ad736d06373fc32f3691c1..1a86ee5093696deb316c532ca8f7de2bbf5cd8ea > 100644 > --- a/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt > +++ b/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt > @@ -36,7 +36,7 @@ > > +0 read(4, ..., 100000) = 4000 > > -// If queue is empty, accept a packet even if its end_seq is above wup + > rcv_wnd > +// If queue is empty, accept a packet even if its end_seq is above > rcv_mwnd_seq > +0 < P. 4001:54001(50000) ack 1 win 257 > +0 > . 1:1(0) ack 54001 win 0 > > -- Stefano
