On Tue, May 19, 2026 at 9:12 AM Stefano Brivio <[email protected]> wrote: > > On Tue, 19 May 2026 08:36:27 -0700 > Andrei Vagin <[email protected]> wrote: > > > On Mon, May 18, 2026 at 4:34 AM Stefano Brivio <[email protected]> wrote: > > > > > > On Mon, 18 May 2026 00:57:16 -0700 > > > Eric Dumazet <[email protected]> wrote: > > > > > > > On Sun, May 17, 2026 at 12:53 PM Stefano Brivio <[email protected]> > > > > wrote: > > > > > > > > > > On Sun, 17 May 2026 12:05:45 -0700 > > > > > Kuniyuki Iwashima <[email protected]> wrote: > > > > > > > > > > > On Sun, May 17, 2026 at 11:41 AM Stefano Brivio > > > > > > <[email protected]> wrote: > > > > > > > > > > > > > > Once a socket enters repair mode (TCP_REPAIR socket option with > > > > > > > TCP_REPAIR_ON value), it's possible to dump the receive sequence > > > > > > > number (TCP_QUEUE_SEQ) and the contents of the receive queue > > > > > > > itself > > > > > > > (using TCP_REPAIR_QUEUE to select it). > > > > > > > > > > > > > > If we receive data after the application fetched the sequence > > > > > > > number > > > > > > > or saved the contents of the queue, though, the application will > > > > > > > now > > > > > > > have outdated information, which defeats the whole functionality, > > > > > > > because this leads to gaps in sequence and data once they're > > > > > > > restored > > > > > > > by the target instance of the application, resulting in a hanging > > > > > > > or > > > > > > > otherwise non-functional TCP connection. > > > > > > > > > > > > > > This type of race condition was discovered in the KubeVirt > > > > > > > integration > > > > > > > of passt(1), using a remote iperf3 client connected to an iperf3 > > > > > > > server running in the guest which is being migrated. The setup > > > > > > > allows > > > > > > > traffic to reach the origin node hosting the guest during the > > > > > > > migration. > > > > > > > > > > > > > > If passt dumps sequence number and contents of the queue *before* > > > > > > > further data is received and acknowledged to the peer by the > > > > > > > kernel, > > > > > > > once the TCP data connection is migrated to the target node, the > > > > > > > remote client becomes unable to continue sending, because a > > > > > > > portion > > > > > > > of the data it sent *and received an acknowledgement for* is now > > > > > > > lost. > > > > > > > > > > > > > > Schematically: > > > > > > > > > > > > > > 1. client --seq 1:100--> origin host --> passt --> guest --> > > > > > > > server > > > > > > > > > > > > > > 2. client <--ACK: 100-- origin host > > > > > > > > > > > > > > 3. migration starts, > > > > > > > > > > > > Here, a netfilter rule or bpf prog must be installed to > > > > > > drop packets temporarily until migration completes. > > > > > > > > > > Thanks for the review. > > > > > > > > > > I have to say it's rather unexpected to have to work around obvious > > > > > kernel issues in userspace: TCP_REPAIR implies that queues are frozen, > > > > > and this is handled correctly on the sending side (see > > > > > tcp_write_xmit()), but was clearly forgotten on the receiving side. > > > > > > > > Have you contacted TCP repair authors for best practices? > > > > > > I Cc'ed them here, Pavel is the author (but as far as I understand not > > > active in kernel development anymore), and I know that Andrei did some > > > substantial work on it in the past, so he's Cc'ed here as well. > > > > > > But we are following what CRIU (the userspace reference implementation) > > > does, and CRIU would be affected by this issue as well (depending on > > > usage). > > > > Before extracting the socket state, CRIU uses netfilter (iptables or > > nftables) to block all traffic associated with the specific TCP > > connection or, in the case of a container, the entire network namespace. > > ...by default, yes, by "depending on usage" I actually meant > '--network-lock skip', but I have to admit I'm not sure how commonly > that is used. > > > This approach provides two main benefits: > > > > During the dump, we don't need to leave all sockets in repair mode for > > the entire duration of the dump. We enable and disable repair mode just > > to grab the state. It's simplify a roll back if something goes wrong. > > Thanks for clarifying this aspect, I wasn't sure whether it was done to > make rollback easier.
I would rephrase that: it likely wasn't implemented originally because Pavel may not have considered it necessary. When checkpointing a container with a separate network namespace, blocking all traffic within that namespace is straightforward. If CRIU crashes or is killed during a dump, we simply unblock the traffic, and everything continues to function without needing to manage individual connections. As I mentioned before, I have no objections to the idea itself, moreover it looks quite rational and will make the code safer regarding undesired side effects/race conditions. Thanks for working on that. Thanks, Andrei

