On Sun, May 17, 2026 at 12:53 PM Stefano Brivio <[email protected]> wrote: > > On Sun, 17 May 2026 12:05:45 -0700 > Kuniyuki Iwashima <[email protected]> wrote: > > > On Sun, May 17, 2026 at 11:41 AM Stefano Brivio <[email protected]> wrote: > > > > > > Once a socket enters repair mode (TCP_REPAIR socket option with > > > TCP_REPAIR_ON value), it's possible to dump the receive sequence > > > number (TCP_QUEUE_SEQ) and the contents of the receive queue itself > > > (using TCP_REPAIR_QUEUE to select it). > > > > > > If we receive data after the application fetched the sequence number > > > or saved the contents of the queue, though, the application will now > > > have outdated information, which defeats the whole functionality, > > > because this leads to gaps in sequence and data once they're restored > > > by the target instance of the application, resulting in a hanging or > > > otherwise non-functional TCP connection. > > > > > > This type of race condition was discovered in the KubeVirt integration > > > of passt(1), using a remote iperf3 client connected to an iperf3 > > > server running in the guest which is being migrated. The setup allows > > > traffic to reach the origin node hosting the guest during the > > > migration. > > > > > > If passt dumps sequence number and contents of the queue *before* > > > further data is received and acknowledged to the peer by the kernel, > > > once the TCP data connection is migrated to the target node, the > > > remote client becomes unable to continue sending, because a portion > > > of the data it sent *and received an acknowledgement for* is now lost. > > > > > > Schematically: > > > > > > 1. client --seq 1:100--> origin host --> passt --> guest --> server > > > > > > 2. client <--ACK: 100-- origin host > > > > > > 3. migration starts, > > > > Here, a netfilter rule or bpf prog must be installed to > > drop packets temporarily until migration completes. > > Thanks for the review. > > I have to say it's rather unexpected to have to work around obvious > kernel issues in userspace: TCP_REPAIR implies that queues are frozen, > and this is handled correctly on the sending side (see > tcp_write_xmit()), but was clearly forgotten on the receiving side. > > TCP_REPAIR also allows to dump queues, not just sequence numbers, so > this really is a bad race condition making the whole functionality > unreliable. > > But anyway, even looking for a practical workaround of the kind you > suggested, I see two issues with it: > > 1. we would still have a race condition, because userspace doesn't have > a way to synchronise application of nftables rules (or even a BPF > program) with the effects of TCP_REPAIR.
Note that setsockopt(TCP_REPAIR) is under lock_sock(), so the backlog is always cleared before returning to userspace, and the following getsockopt() will have stable view. > We could apply nftables > rules "a while before" just to be sure, but this is severely going to > impact migration downtime So it's "just before", not "a while before". > > 2. passt(1) runs unprivileged and uses a very simple helper to set > TCP_REPAIR on the socket. I guess it uses userns ? setsockopt(TCP_REPAIR) requires CAP_NET_ADMIN in the userns tied to the socket's netns (in tcp_can_repair_sock()), so you should be able to use netfilter anyway. > Expanding helpers of this kind to directly > manipulate nftables rules, or installing BPF programs, looks like a > substantial security drawback > > > We do not want unlikely tests in the fast path. > > I understand, but note that this doesn't really add a branch to the > fast path: there's already a list of (more expensive) conditions under > which we need to fall back to slow path, with 'tp' definitely > prefetched at that point, so I don't expect any fast path cost from > doing this. Everything else is handled in the slow path. > > To be sure, I also checked throughput with delivery to local sockets as > that's the only case affected (veth setup similar to the one from the > selftests from patch 2/2) and there's no visible difference. > > > You can find a similar issue: > > https://lore.kernel.org/netdev/[email protected]/ > > That one is not a kernel issue: in that case, the socket is closed, so > it's actually expected that the kernel will reset the connection. As > Jakub pointed out, that patch introduces a race condition on its own, > and it's a hack rather than a fix. > > We happened to have that kind of issue in passt as well (the > implementation is inspired by CRIU), but that's something entirely > different which userspace clearly needs to take care of, so we fixed it > here: > > https://passt.top/passt/commit/?id=a8782865c342eb2682cca292d5bf92b567344351 > > > > passt enables repair mode, dumps the sequence > > > number (101) and sends it to the target node of the guest migration > > > > > > 4. client --seq 101:201--> origin host (passt not receiving anymore) > > > > > > 5. client <--ACK: 201-- origin host > > > > > > 6. migration completes, and passt restores sequence number 101 on the > > > migrated socket > > > > > > 7. client --seq 201:301--> target host (now seeing a sequence jump) > > > > > > 8. client <--ACK: 100-- target host > > > > > > ...and the connection can't recover anymore, because the client can't > > > resend data that was already (erroneously) acknowledged. We need to > > > avoid step 5. above. > > > > > > This would equally affect CRIU (the other known user of TCP_REPAIR), > > > should data be received while the original container is frozen: the > > > sequence dumped and the contents of the saved incoming queue would > > > then depend on the timing. > > > > > > The race condition is also illustrated in the kselftests introduced > > > by the next patch. > > > > > > To prevent this issue, discard data received for a socket in repair > > > mode, with a new reason, SKB_DROP_REASON_SOCKET_REPAIR. > > > > > > Fixes: ee9952831cfd ("tcp: Initial repair mode") > > > Tested-by: Laurent Vivier <[email protected]> > > > Signed-off-by: Stefano Brivio <[email protected]> > > > --- > > > include/net/dropreason-core.h | 3 +++ > > > net/ipv4/tcp_input.c | 14 +++++++++++++- > > > 2 files changed, 16 insertions(+), 1 deletion(-) > > > > > > diff --git a/include/net/dropreason-core.h b/include/net/dropreason-core.h > > > index 2f312d1f67d6..19ab9e6ffc33 100644 > > > --- a/include/net/dropreason-core.h > > > +++ b/include/net/dropreason-core.h > > > @@ -9,6 +9,7 @@ > > > FN(SOCKET_CLOSE) \ > > > FN(SOCKET_FILTER) \ > > > FN(SOCKET_RCVBUFF) \ > > > + FN(SOCKET_REPAIR) \ > > > FN(UNIX_DISCONNECT) \ > > > FN(UNIX_SKIP_OOB) \ > > > FN(PKT_TOO_SMALL) \ > > > @@ -158,6 +159,8 @@ enum skb_drop_reason { > > > SKB_DROP_REASON_SOCKET_FILTER, > > > /** @SKB_DROP_REASON_SOCKET_RCVBUFF: socket receive buff is full > > > */ > > > SKB_DROP_REASON_SOCKET_RCVBUFF, > > > + /** @SKB_DROP_REASON_SOCKET_REPAIR: socket is in repair mode */ > > > + SKB_DROP_REASON_SOCKET_REPAIR, > > > /** > > > * @SKB_DROP_REASON_UNIX_DISCONNECT: recv queue is purged when > > > SOCK_DGRAM > > > * or SOCK_SEQPACKET socket re-connect()s to another socket or > > > notices > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > > > index d5c9e65d9760..6eca34274f97 100644 > > > --- a/net/ipv4/tcp_input.c > > > +++ b/net/ipv4/tcp_input.c > > > @@ -6457,6 +6457,7 @@ static bool tcp_validate_incoming(struct sock *sk, > > > struct sk_buff *skb, > > > * or pure receivers (this means either the sequence number or the > > > ack > > > * value must stay constant) > > > * - Unexpected TCP option. > > > + * - Socket is in repair mode. > > > * > > > * When these conditions are not satisfied it drops into a standard > > > * receive procedure patterned after RFC793 to handle all cases. > > > @@ -6506,7 +6507,8 @@ void tcp_rcv_established(struct sock *sk, struct > > > sk_buff *skb) > > > > > > if ((tcp_flag_word(th) & TCP_HP_BITS) == tp->pred_flags && > > > TCP_SKB_CB(skb)->seq == tp->rcv_nxt && > > > - !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) { > > > + !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt) && > > > + !tp->repair) { > > > int tcp_header_len = tp->tcp_header_len; > > > s32 delta = 0; > > > int flag = 0; > > > @@ -6632,6 +6634,11 @@ void tcp_rcv_established(struct sock *sk, struct > > > sk_buff *skb) > > > goto discard; > > > } > > > > > > + if (tp->repair) { > > > + reason = SKB_DROP_REASON_SOCKET_REPAIR; > > > + goto discard; > > > + } > > > + > > > /* > > > * Standard slow path. > > > */ > > > @@ -7125,6 +7132,11 @@ tcp_rcv_state_process(struct sock *sk, struct > > > sk_buff *skb) > > > int queued = 0; > > > SKB_DR(reason); > > > > > > + if (tp->repair) { > > > + SKB_DR_SET(reason, SOCKET_REPAIR); > > > + goto discard; > > > + } > > > + > > > switch (sk->sk_state) { > > > case TCP_CLOSE: > > > SKB_DR_SET(reason, TCP_CLOSE); > > > -- > > > 2.43.0 > > -- > Stefano >

