On Tue, 24 Feb 2026 19:07:45 +0100
Simon Baatz <[email protected]> wrote:

> Hi Stefano,
> 
> On Mon, Feb 23, 2026 at 11:26:40PM +0100, Stefano Brivio wrote:
> > Hi Simon,
> > 
> > It all makes sense to me at a quick look, I have just some nits and one
> > more substantial worry, below:
> > 
> > On Fri, 20 Feb 2026 00:55:14 +0100
> > Simon Baatz via B4 Relay <[email protected]> wrote:
> >   
> > > From: Simon Baatz <[email protected]>
> > > 
> > > By default, the Linux TCP implementation does not shrink the
> > > advertised window (RFC 7323 calls this "window retraction") with the
> > > following exceptions:
> > > 
> > > - When an incoming segment cannot be added due to the receive buffer
> > >   running out of memory. Since commit 8c670bdfa58e ("tcp: correct
> > >   handling of extreme memory squeeze") a zero window will be
> > >   advertised in this case. It turns out that reaching the required
> > >   "memory pressure" is very easy when window scaling is in use. In the
> > >   simplest case, sending a sufficient number of segments smaller than
> > >   the scale factor to a receiver that does not read data is enough.
> > > 
> > >   Since commit 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") this
> > >   happens much earlier than before, leading to regressions (the test
> > >   suite of the Valkey project does not pass because of a TCP
> > >   connection that is no longer bi-directional).  
> > 
> > Ouch. By the way, that same commit helped us unveil an issue (at least
> > in the sense of RFC 9293, 3.8.6) we fixed in passt:
> > 
> >   
> > https://passt.top/passt/commit/?id=8d2f8c4d0fb58d6b2011e614bc7d7ff9dab406b3 
> >  
> 
> This looks concerning: It seems as if just filling the advertised
> window triggered the out of memory condition(?).

Right, even if it's not so much a general "out of memory" condition:
it's just that the socket might simply refuse to queue more data at
that point (we run out of window space, rather than memory).

Together with commit e2142825c120 ("net: tcp: send zero-window ACK when
no memory"), we will even get zero-window updates in that case. Jon
raised the issue here:

  https://lore.kernel.org/r/[email protected]/

but it was not really fixed. Anyway:

> Am I right in
> assuming that this happened with the original 1d2fbaad7cd8, not the
> relaxed version of tcp_can_ingest() from f017c1f768b?

...you're right. I wasn't even aware of f017c1f768b, thanks for
pointing that out. That seems to make things saner, and I don't expect
further issues at this point.

By the way of which, passt struggled talking to applications entirely
written in the 21st century. That's socat, I think started in 2001,
being used in Podman tests, and its only SO_RCVBUF-related fault is
that it uses the default 208 KiB value (from rmem_default) as a
starting value by... not doing anything.

Applications can set SO_RCVBUF and SO_SNDBUF to bigger values
(depending on rmem_max and wmem_max), but if they do, automatic tuning
of TCP buffer sizes (which allows exceeding rmem_max and wmem_max!) is
disabled. We used to do that in passt itself, and I eventually dropped
it here:

  https://passt.top/passt/commit/?id=71249ef3f9bcf1dbb2d6c13cdbc41ba88c794f06

because we might really need automatic tuning and the resulting big
buffers for high latency, high throughput connections.

> > > - Commit b650d953cd39 ("tcp: enforce receive buffer memory limits by
> > >   allowing the tcp window to shrink") addressed the "eating memory"
> > >   problem by introducing a sysctl knob that allows shrinking the
> > >   window before running out of memory.
> > > 
> > > However, RFC 7323 does not only state that shrinking the window is
> > > necessary in some cases, it also formulates requirements for TCP
> > > implementations when doing so (Section 2.4).
> > > 
> > > This commit addresses the receiver-side requirements: After retracting
> > > the window, the peer may have a snd_nxt that lies within a previously
> > > advertised window but is now beyond the retracted window. This means
> > > that all incoming segments (including pure ACKs) will be rejected
> > > until the application happens to read enough data to let the peer's
> > > snd_nxt be in window again (which may be never).
> > > 
> > > To comply with RFC 7323, the receiver MUST honor any segment that
> > > would have been in window for any ACK sent by the receiver and, when
> > > window scaling is in effect, SHOULD track the maximum window sequence
> > > number it has advertised. This patch tracks that maximum window
> > > sequence number throughout the connection and uses it in
> > > tcp_sequence() when deciding whether a segment is acceptable.
> > > Acceptability of data is not changed.
> > > 
> > > Fixes: 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze")
> > > Fixes: b650d953cd39 ("tcp: enforce receive buffer memory limits by 
> > > allowing the tcp window to shrink")
> > > Signed-off-by: Simon Baatz <[email protected]>
> > > ---
> > >  Documentation/networking/net_cachelines/tcp_sock.rst       |  1 +
> > >  include/linux/tcp.h                                        |  1 +
> > >  include/net/tcp.h                                          | 14 
> > > ++++++++++++++
> > >  net/ipv4/tcp_fastopen.c                                    |  1 +
> > >  net/ipv4/tcp_input.c                                       |  6 ++++--
> > >  net/ipv4/tcp_minisocks.c                                   |  1 +
> > >  net/ipv4/tcp_output.c                                      | 12 
> > > ++++++++++++
> > >  .../selftests/net/packetdrill/tcp_rcv_big_endseq.pkt       |  2 +-
> > >  8 files changed, 35 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst 
> > > b/Documentation/networking/net_cachelines/tcp_sock.rst
> > > index 
> > > 563daea10d6c5c074f004cb1b8574f5392157abb..fecf61166a54ee2f64bcef5312c81dcc4aa9a124
> > >  100644
> > > --- a/Documentation/networking/net_cachelines/tcp_sock.rst
> > > +++ b/Documentation/networking/net_cachelines/tcp_sock.rst
> > > @@ -121,6 +121,7 @@ u64                           delivered_mstamp        
> > > read_write
> > >  u32                           rate_delivered                             
> > >  read_mostly         tcp_rate_gen
> > >  u32                           rate_interval_us                           
> > >  read_mostly         rate_delivered,rate_app_limited
> > >  u32                           rcv_wnd                 read_write         
> > >  read_mostly         
> > > tcp_select_window,tcp_receive_window,tcp_fast_path_check
> > > +u32                           rcv_mwnd_seq            read_write         
> > >                      tcp_select_window
> > >  u32                           write_seq               read_write         
> > >                      
> > > tcp_rate_check_app_limited,tcp_write_queue_empty,tcp_skb_entail,forced_push,tcp_mark_push
> > >  u32                           notsent_lowat           read_mostly        
> > >                      tcp_stream_memory_free
> > >  u32                           pushed_seq              read_write         
> > >                      tcp_mark_push,forced_push
> > > diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> > > index 
> > > f72eef31fa23cc584f2f0cefacdc35cae43aa52d..5a943b12d4c050a980b4cf81635b9fa2f0036283
> > >  100644
> > > --- a/include/linux/tcp.h
> > > +++ b/include/linux/tcp.h
> > > @@ -271,6 +271,7 @@ struct tcp_sock {
> > >   u32     lsndtime;       /* timestamp of last sent data packet (for 
> > > restart window) */
> > >   u32     mdev_us;        /* medium deviation                     */
> > >   u32     rtt_seq;        /* sequence number to update rttvar     */
> > > + u32     rcv_mwnd_seq; /* Maximum window sequence number (RFC 7323, 
> > > section 2.4) */  
> > 
> > Nit: tab between ; and /* for consistency (I would personally prefer
> > the comment style as you see on 'highest_sack' but I don't think it's
> > enforced anymore).  
> 
> Thanks, I missed that.
>  
> > Second nit: mentioning RFC 7323, section 2.4 could be a bit misleading
> > here because the relevant paragraph there covers a very specific case of
> > window retraction, caused by quantisation error from window scaling,
> > which is not the most common case here. I couldn't quickly find a better
> > reference though.  
> 
> I agree, but there is a part that, I think, is more generally
> applicable:
> 
> 2.4.  Addressing Window Retraction
> 
>    [ specific window retraction case introduction removed ]
>    ... Implementations MUST ensure that they handle a shrinking
>    window, as specified in Section 4.2.2.16 of [RFC1122].
> 
>    For the receiver, this implies that:
> 
>    1)  The receiver MUST honor, as in window, any segment that would
>        have been in window for any <ACK> sent by the receiver.
> 
>    2)  When window scaling is in effect, the receiver SHOULD track the
>        actual maximum window sequence number (which is likely to be
>        greater than the window announced by the most recent <ACK>, if
>        more than one segment has arrived since the application consumed
>        any data in the receive buffer).
> 
> There is no "When window scaling is in effect," on the first
> requirement. And it "happens" to be implementable by the second
> requirement (with or without window scaling).

Right, I saw that, but the first requirement doesn't mention the
"actual maximum sequence number" which this new field represents.

> I think an improvement could be to refer to the receiver requirements
> specifically here.

Ah, yes, that sounds like a good idea.

> > More importantly: do we need to restore this on a connection that's
> > being dumped and recreated using TCP_REPAIR, or will things still work
> > (even though sub-optimally) if we lose this value?
> > 
> > Other window values that *need* to be dumped and restored are currently
> > available via TCP_REPAIR_WINDOW socket option, and they are listed in
> > do_tcp_getsockopt(), net/ipv4/tcp.c:
> > 
> >             opt.snd_wl1     = tp->snd_wl1;
> >             opt.snd_wnd     = tp->snd_wnd;
> >             opt.max_window  = tp->max_window;
> >             opt.rcv_wnd     = tp->rcv_wnd;
> >             opt.rcv_wup     = tp->rcv_wup;
> > 
> > CRIU uses it to checkpoint and restore established connections, and
> > passt uses it to migrate them to a different host:
> > 
> >   https://criu.org/TCP_connection
> > 
> >   
> > https://passt.top/passt/tree/tcp.c?id=02af38d4177550c086bae54246fc3aaa33ddc018#n3063
> > 
> > If it's strictly needed to preserve functionality, we would need to add
> > it to struct tcp_repair_window, notify CRIU maintainers (or send them a
> > patch), and add this in passt as well (I can take care of it).  
> 
> Thanks for the pointer, I missed that tp->rcv_wnd update.  Could the
> following happen when checkpointing/restoring?
> 
> 1. A client app opens a connection and writes (blocking) a specific amount
>    of data before doing any reads.  (Not very clever, but this is
>    supposed to work; this is what caused the problem in the Valkey
>    tests.)
> 2. The traffic pattern causes an out-of-memory condition for the
>    receive buffer; we see the RWIN 0 segments that do not ack the
>    last data segment(s).
> 3. TCP connection is checkpointed and restored (on the client side) without
>    restoring rcv_mwnd_seq.
> 4. If the receive buffer is still full at the new location, the
>    acceptable sequence numbers in the receive window will not change
>    (restored client is still blocked on write) and we no longer have
>    the larger max receive window -> the client's kernel will reject
>    all incoming packets and the connection is stuck.
> 
> If this scenario is possible, I'd argue that rcv_mwnd_seq is
> necessary.

It really sounds like a corner case, especially 1. in combination with
2., but the outcome would be pretty bad, and I think it's possible.

Typically, once the connection is restored (with TCP_REPAIR_OFF, not
with TCP_REPAIR_OFF_NO_WP), the kernel sends out an empty segment as a
window probe / keepalive, but as far as I understand that wouldn't be
enough to fix the situation. And even if it did, we still have the
TCP_REPAIR_OFF_NO_WP case, even though I'm not aware of any usage.

> > Strictly speaking, in case, this could be considered a breaking change
> > for userspace, but I don't see how to avoid it, so I'd just make sure
> > it doesn't impact users as TCP_REPAIR has just a couple of (known!)
> > projects relying on it.
> > 
> > An alternative would be to have a special, initial value representing
> > the fact that this value was lost, but it looks really annoying to not
> > be able to use a u32 for it.  
> 
> Do we need a dedicated value indicating that rcv_mwnd_seq is not
> present, or is it enough to choose an initial rcv_mwnd_seq based on
> the size of the struct passed?  Both seem doable to me:
> 
> Missing: Initialize rcv_mwnd_seq = rcv_wup + rcv_wnd (possibly
> leading to the problem described above, of course)

Well but if we might run into the problem described above, we need to
dump / restore rcv_mwnd_seq in any case, so we wouldn't have an issue
at all.

Except for a compatibility issue, but what you describe looks like a
reasonable fallback.

> Default value 0: Store how much we retracted the window, i.e. 
> rcv_mwnd_seq - (rcv_wup + rcv_wnd). 0 means the window was not
> retracted and could double as the "we don't know" value.
> 
> For the time being, I will just initialize rcv_mwnd_seq to rcv_wup +
> rcv_wnd in tcp_repair_set_window() to keep status quo. Of course,
> I am happy to discuss enhancements.

That makes sense to me at a glance, but I should still review / test it
as a whole though.

> > Disregard all this if the correct value is not strictly needed for
> > functionality, of course. I haven't tested things (not yet, at least).
> >   
> > >   u64     tcp_wstamp_ns;  /* departure time for next sent data packet */
> > >   u64     accecn_opt_tstamp;      /* Last AccECN option sent timestamp */
> > >   struct list_head tsorted_sent_queue; /* time-sorted sent but un-SACKed 
> > > skbs */
> > > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > > index 
> > > 40e72b9cb85f08714d3f458c0bd1402a5fb1eb4e..e1944d504823d5f8754d85bfbbf3c9630d2190ac
> > >  100644
> > > --- a/include/net/tcp.h
> > > +++ b/include/net/tcp.h
> > > @@ -912,6 +912,20 @@ static inline u32 tcp_receive_window(const struct 
> > > tcp_sock *tp)
> > >   return (u32) win;
> > >  }
> > >  
> > > +/* Compute the maximum receive window we ever advertised.
> > > + * Rcv_nxt can be after the window if our peer push more data  
> > 
> > s/push/pushes/
> > 
> > s/Rcv_nxt/rcv_nxt/ (useful for grepping)  
> 
> tcp_max_receive_window() is an adapted copy of
> tcp_receive_window() above. But it makes sense to improve it.

Ah, sorry, I didn't notice.

> > > + * than the offered window.
> > > + */
> > > +static inline u32 tcp_max_receive_window(const struct tcp_sock *tp)
> > > +{
> > > + s32 win = tp->rcv_mwnd_seq - tp->rcv_nxt;
> > > +
> > > + if (win < 0)
> > > +         win = 0;  
> > 
> > I must be missing something but... if the sequence is about to wrap,
> > we'll return 0 here. Is that intended?
> > 
> > Doing the subtraction unsigned would have looked more natural to me,
> > but I didn't really think it through.  
> 
> The substraction is unsigned and the outcome is interpreted as
> signed. And as mentioned, it is copied with pride ;-)

Oh, wow, I mean, "of course"! How could anybody ever miss that! Pride,
you say. :) ...but sure, if it's taken from there, it makes sense to
keep it like that I guess.

> > > + return (u32) win;  
> > 
> > Kernel coding style doesn't usually include a space between cast and
> > identifier.  
> 
> Yes, same reason as above and I will change it.

-- 
Stefano


Reply via email to