On Tue, 24 Feb 2026 19:07:45 +0100 Simon Baatz <[email protected]> wrote:
> Hi Stefano, > > On Mon, Feb 23, 2026 at 11:26:40PM +0100, Stefano Brivio wrote: > > Hi Simon, > > > > It all makes sense to me at a quick look, I have just some nits and one > > more substantial worry, below: > > > > On Fri, 20 Feb 2026 00:55:14 +0100 > > Simon Baatz via B4 Relay <[email protected]> wrote: > > > > > From: Simon Baatz <[email protected]> > > > > > > By default, the Linux TCP implementation does not shrink the > > > advertised window (RFC 7323 calls this "window retraction") with the > > > following exceptions: > > > > > > - When an incoming segment cannot be added due to the receive buffer > > > running out of memory. Since commit 8c670bdfa58e ("tcp: correct > > > handling of extreme memory squeeze") a zero window will be > > > advertised in this case. It turns out that reaching the required > > > "memory pressure" is very easy when window scaling is in use. In the > > > simplest case, sending a sufficient number of segments smaller than > > > the scale factor to a receiver that does not read data is enough. > > > > > > Since commit 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") this > > > happens much earlier than before, leading to regressions (the test > > > suite of the Valkey project does not pass because of a TCP > > > connection that is no longer bi-directional). > > > > Ouch. By the way, that same commit helped us unveil an issue (at least > > in the sense of RFC 9293, 3.8.6) we fixed in passt: > > > > > > https://passt.top/passt/commit/?id=8d2f8c4d0fb58d6b2011e614bc7d7ff9dab406b3 > > > > This looks concerning: It seems as if just filling the advertised > window triggered the out of memory condition(?). Right, even if it's not so much a general "out of memory" condition: it's just that the socket might simply refuse to queue more data at that point (we run out of window space, rather than memory). Together with commit e2142825c120 ("net: tcp: send zero-window ACK when no memory"), we will even get zero-window updates in that case. Jon raised the issue here: https://lore.kernel.org/r/[email protected]/ but it was not really fixed. Anyway: > Am I right in > assuming that this happened with the original 1d2fbaad7cd8, not the > relaxed version of tcp_can_ingest() from f017c1f768b? ...you're right. I wasn't even aware of f017c1f768b, thanks for pointing that out. That seems to make things saner, and I don't expect further issues at this point. By the way of which, passt struggled talking to applications entirely written in the 21st century. That's socat, I think started in 2001, being used in Podman tests, and its only SO_RCVBUF-related fault is that it uses the default 208 KiB value (from rmem_default) as a starting value by... not doing anything. Applications can set SO_RCVBUF and SO_SNDBUF to bigger values (depending on rmem_max and wmem_max), but if they do, automatic tuning of TCP buffer sizes (which allows exceeding rmem_max and wmem_max!) is disabled. We used to do that in passt itself, and I eventually dropped it here: https://passt.top/passt/commit/?id=71249ef3f9bcf1dbb2d6c13cdbc41ba88c794f06 because we might really need automatic tuning and the resulting big buffers for high latency, high throughput connections. > > > - Commit b650d953cd39 ("tcp: enforce receive buffer memory limits by > > > allowing the tcp window to shrink") addressed the "eating memory" > > > problem by introducing a sysctl knob that allows shrinking the > > > window before running out of memory. > > > > > > However, RFC 7323 does not only state that shrinking the window is > > > necessary in some cases, it also formulates requirements for TCP > > > implementations when doing so (Section 2.4). > > > > > > This commit addresses the receiver-side requirements: After retracting > > > the window, the peer may have a snd_nxt that lies within a previously > > > advertised window but is now beyond the retracted window. This means > > > that all incoming segments (including pure ACKs) will be rejected > > > until the application happens to read enough data to let the peer's > > > snd_nxt be in window again (which may be never). > > > > > > To comply with RFC 7323, the receiver MUST honor any segment that > > > would have been in window for any ACK sent by the receiver and, when > > > window scaling is in effect, SHOULD track the maximum window sequence > > > number it has advertised. This patch tracks that maximum window > > > sequence number throughout the connection and uses it in > > > tcp_sequence() when deciding whether a segment is acceptable. > > > Acceptability of data is not changed. > > > > > > Fixes: 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze") > > > Fixes: b650d953cd39 ("tcp: enforce receive buffer memory limits by > > > allowing the tcp window to shrink") > > > Signed-off-by: Simon Baatz <[email protected]> > > > --- > > > Documentation/networking/net_cachelines/tcp_sock.rst | 1 + > > > include/linux/tcp.h | 1 + > > > include/net/tcp.h | 14 > > > ++++++++++++++ > > > net/ipv4/tcp_fastopen.c | 1 + > > > net/ipv4/tcp_input.c | 6 ++++-- > > > net/ipv4/tcp_minisocks.c | 1 + > > > net/ipv4/tcp_output.c | 12 > > > ++++++++++++ > > > .../selftests/net/packetdrill/tcp_rcv_big_endseq.pkt | 2 +- > > > 8 files changed, 35 insertions(+), 3 deletions(-) > > > > > > diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst > > > b/Documentation/networking/net_cachelines/tcp_sock.rst > > > index > > > 563daea10d6c5c074f004cb1b8574f5392157abb..fecf61166a54ee2f64bcef5312c81dcc4aa9a124 > > > 100644 > > > --- a/Documentation/networking/net_cachelines/tcp_sock.rst > > > +++ b/Documentation/networking/net_cachelines/tcp_sock.rst > > > @@ -121,6 +121,7 @@ u64 delivered_mstamp > > > read_write > > > u32 rate_delivered > > > read_mostly tcp_rate_gen > > > u32 rate_interval_us > > > read_mostly rate_delivered,rate_app_limited > > > u32 rcv_wnd read_write > > > read_mostly > > > tcp_select_window,tcp_receive_window,tcp_fast_path_check > > > +u32 rcv_mwnd_seq read_write > > > tcp_select_window > > > u32 write_seq read_write > > > > > > tcp_rate_check_app_limited,tcp_write_queue_empty,tcp_skb_entail,forced_push,tcp_mark_push > > > u32 notsent_lowat read_mostly > > > tcp_stream_memory_free > > > u32 pushed_seq read_write > > > tcp_mark_push,forced_push > > > diff --git a/include/linux/tcp.h b/include/linux/tcp.h > > > index > > > f72eef31fa23cc584f2f0cefacdc35cae43aa52d..5a943b12d4c050a980b4cf81635b9fa2f0036283 > > > 100644 > > > --- a/include/linux/tcp.h > > > +++ b/include/linux/tcp.h > > > @@ -271,6 +271,7 @@ struct tcp_sock { > > > u32 lsndtime; /* timestamp of last sent data packet (for > > > restart window) */ > > > u32 mdev_us; /* medium deviation */ > > > u32 rtt_seq; /* sequence number to update rttvar */ > > > + u32 rcv_mwnd_seq; /* Maximum window sequence number (RFC 7323, > > > section 2.4) */ > > > > Nit: tab between ; and /* for consistency (I would personally prefer > > the comment style as you see on 'highest_sack' but I don't think it's > > enforced anymore). > > Thanks, I missed that. > > > Second nit: mentioning RFC 7323, section 2.4 could be a bit misleading > > here because the relevant paragraph there covers a very specific case of > > window retraction, caused by quantisation error from window scaling, > > which is not the most common case here. I couldn't quickly find a better > > reference though. > > I agree, but there is a part that, I think, is more generally > applicable: > > 2.4. Addressing Window Retraction > > [ specific window retraction case introduction removed ] > ... Implementations MUST ensure that they handle a shrinking > window, as specified in Section 4.2.2.16 of [RFC1122]. > > For the receiver, this implies that: > > 1) The receiver MUST honor, as in window, any segment that would > have been in window for any <ACK> sent by the receiver. > > 2) When window scaling is in effect, the receiver SHOULD track the > actual maximum window sequence number (which is likely to be > greater than the window announced by the most recent <ACK>, if > more than one segment has arrived since the application consumed > any data in the receive buffer). > > There is no "When window scaling is in effect," on the first > requirement. And it "happens" to be implementable by the second > requirement (with or without window scaling). Right, I saw that, but the first requirement doesn't mention the "actual maximum sequence number" which this new field represents. > I think an improvement could be to refer to the receiver requirements > specifically here. Ah, yes, that sounds like a good idea. > > More importantly: do we need to restore this on a connection that's > > being dumped and recreated using TCP_REPAIR, or will things still work > > (even though sub-optimally) if we lose this value? > > > > Other window values that *need* to be dumped and restored are currently > > available via TCP_REPAIR_WINDOW socket option, and they are listed in > > do_tcp_getsockopt(), net/ipv4/tcp.c: > > > > opt.snd_wl1 = tp->snd_wl1; > > opt.snd_wnd = tp->snd_wnd; > > opt.max_window = tp->max_window; > > opt.rcv_wnd = tp->rcv_wnd; > > opt.rcv_wup = tp->rcv_wup; > > > > CRIU uses it to checkpoint and restore established connections, and > > passt uses it to migrate them to a different host: > > > > https://criu.org/TCP_connection > > > > > > https://passt.top/passt/tree/tcp.c?id=02af38d4177550c086bae54246fc3aaa33ddc018#n3063 > > > > If it's strictly needed to preserve functionality, we would need to add > > it to struct tcp_repair_window, notify CRIU maintainers (or send them a > > patch), and add this in passt as well (I can take care of it). > > Thanks for the pointer, I missed that tp->rcv_wnd update. Could the > following happen when checkpointing/restoring? > > 1. A client app opens a connection and writes (blocking) a specific amount > of data before doing any reads. (Not very clever, but this is > supposed to work; this is what caused the problem in the Valkey > tests.) > 2. The traffic pattern causes an out-of-memory condition for the > receive buffer; we see the RWIN 0 segments that do not ack the > last data segment(s). > 3. TCP connection is checkpointed and restored (on the client side) without > restoring rcv_mwnd_seq. > 4. If the receive buffer is still full at the new location, the > acceptable sequence numbers in the receive window will not change > (restored client is still blocked on write) and we no longer have > the larger max receive window -> the client's kernel will reject > all incoming packets and the connection is stuck. > > If this scenario is possible, I'd argue that rcv_mwnd_seq is > necessary. It really sounds like a corner case, especially 1. in combination with 2., but the outcome would be pretty bad, and I think it's possible. Typically, once the connection is restored (with TCP_REPAIR_OFF, not with TCP_REPAIR_OFF_NO_WP), the kernel sends out an empty segment as a window probe / keepalive, but as far as I understand that wouldn't be enough to fix the situation. And even if it did, we still have the TCP_REPAIR_OFF_NO_WP case, even though I'm not aware of any usage. > > Strictly speaking, in case, this could be considered a breaking change > > for userspace, but I don't see how to avoid it, so I'd just make sure > > it doesn't impact users as TCP_REPAIR has just a couple of (known!) > > projects relying on it. > > > > An alternative would be to have a special, initial value representing > > the fact that this value was lost, but it looks really annoying to not > > be able to use a u32 for it. > > Do we need a dedicated value indicating that rcv_mwnd_seq is not > present, or is it enough to choose an initial rcv_mwnd_seq based on > the size of the struct passed? Both seem doable to me: > > Missing: Initialize rcv_mwnd_seq = rcv_wup + rcv_wnd (possibly > leading to the problem described above, of course) Well but if we might run into the problem described above, we need to dump / restore rcv_mwnd_seq in any case, so we wouldn't have an issue at all. Except for a compatibility issue, but what you describe looks like a reasonable fallback. > Default value 0: Store how much we retracted the window, i.e. > rcv_mwnd_seq - (rcv_wup + rcv_wnd). 0 means the window was not > retracted and could double as the "we don't know" value. > > For the time being, I will just initialize rcv_mwnd_seq to rcv_wup + > rcv_wnd in tcp_repair_set_window() to keep status quo. Of course, > I am happy to discuss enhancements. That makes sense to me at a glance, but I should still review / test it as a whole though. > > Disregard all this if the correct value is not strictly needed for > > functionality, of course. I haven't tested things (not yet, at least). > > > > > u64 tcp_wstamp_ns; /* departure time for next sent data packet */ > > > u64 accecn_opt_tstamp; /* Last AccECN option sent timestamp */ > > > struct list_head tsorted_sent_queue; /* time-sorted sent but un-SACKed > > > skbs */ > > > diff --git a/include/net/tcp.h b/include/net/tcp.h > > > index > > > 40e72b9cb85f08714d3f458c0bd1402a5fb1eb4e..e1944d504823d5f8754d85bfbbf3c9630d2190ac > > > 100644 > > > --- a/include/net/tcp.h > > > +++ b/include/net/tcp.h > > > @@ -912,6 +912,20 @@ static inline u32 tcp_receive_window(const struct > > > tcp_sock *tp) > > > return (u32) win; > > > } > > > > > > +/* Compute the maximum receive window we ever advertised. > > > + * Rcv_nxt can be after the window if our peer push more data > > > > s/push/pushes/ > > > > s/Rcv_nxt/rcv_nxt/ (useful for grepping) > > tcp_max_receive_window() is an adapted copy of > tcp_receive_window() above. But it makes sense to improve it. Ah, sorry, I didn't notice. > > > + * than the offered window. > > > + */ > > > +static inline u32 tcp_max_receive_window(const struct tcp_sock *tp) > > > +{ > > > + s32 win = tp->rcv_mwnd_seq - tp->rcv_nxt; > > > + > > > + if (win < 0) > > > + win = 0; > > > > I must be missing something but... if the sequence is about to wrap, > > we'll return 0 here. Is that intended? > > > > Doing the subtraction unsigned would have looked more natural to me, > > but I didn't really think it through. > > The substraction is unsigned and the outcome is interpreted as > signed. And as mentioned, it is copied with pride ;-) Oh, wow, I mean, "of course"! How could anybody ever miss that! Pride, you say. :) ...but sure, if it's taken from there, it makes sense to keep it like that I guess. > > > + return (u32) win; > > > > Kernel coding style doesn't usually include a space between cast and > > identifier. > > Yes, same reason as above and I will change it. -- Stefano

