On Sun, May 17, 2026 at 12:53 PM Stefano Brivio <[email protected]> wrote:
>
> On Sun, 17 May 2026 12:05:45 -0700
> Kuniyuki Iwashima <[email protected]> wrote:
>
> > On Sun, May 17, 2026 at 11:41 AM Stefano Brivio <[email protected]> wrote:
> > >
> > > Once a socket enters repair mode (TCP_REPAIR socket option with
> > > TCP_REPAIR_ON value), it's possible to dump the receive sequence
> > > number (TCP_QUEUE_SEQ) and the contents of the receive queue itself
> > > (using TCP_REPAIR_QUEUE to select it).
> > >
> > > If we receive data after the application fetched the sequence number
> > > or saved the contents of the queue, though, the application will now
> > > have outdated information, which defeats the whole functionality,
> > > because this leads to gaps in sequence and data once they're restored
> > > by the target instance of the application, resulting in a hanging or
> > > otherwise non-functional TCP connection.
> > >
> > > This type of race condition was discovered in the KubeVirt integration
> > > of passt(1), using a remote iperf3 client connected to an iperf3
> > > server running in the guest which is being migrated. The setup allows
> > > traffic to reach the origin node hosting the guest during the
> > > migration.
> > >
> > > If passt dumps sequence number and contents of the queue *before*
> > > further data is received and acknowledged to the peer by the kernel,
> > > once the TCP data connection is migrated to the target node, the
> > > remote client becomes unable to continue sending, because a portion
> > > of the data it sent *and received an acknowledgement for* is now lost.
> > >
> > > Schematically:
> > >
> > > 1. client --seq 1:100--> origin host --> passt --> guest --> server
> > >
> > > 2. client <--ACK: 100-- origin host
> > >
> > > 3. migration starts,
> >
> > Here, a netfilter rule or bpf prog must be installed to
> > drop packets temporarily until migration completes.
>
> Thanks for the review.
>
> I have to say it's rather unexpected to have to work around obvious
> kernel issues in userspace: TCP_REPAIR implies that queues are frozen,
> and this is handled correctly on the sending side (see
> tcp_write_xmit()), but was clearly forgotten on the receiving side.
>
> TCP_REPAIR also allows to dump queues, not just sequence numbers, so
> this really is a bad race condition making the whole functionality
> unreliable.
>
> But anyway, even looking for a practical workaround of the kind you
> suggested, I see two issues with it:
>
> 1. we would still have a race condition, because userspace doesn't have
>    a way to synchronise application of nftables rules (or even a BPF
>    program) with the effects of TCP_REPAIR.

Note that setsockopt(TCP_REPAIR) is under lock_sock(), so the
backlog is always cleared before returning to userspace, and the
following getsockopt() will have stable view.


> We could apply nftables
>    rules "a while before" just to be sure, but this is severely going to
>    impact migration downtime

So it's "just before", not "a while before".


>
> 2. passt(1) runs unprivileged and uses a very simple helper to set
>    TCP_REPAIR on the socket.

I guess it uses userns ?

setsockopt(TCP_REPAIR) requires CAP_NET_ADMIN in the
userns tied to the socket's netns (in tcp_can_repair_sock()),
so you should be able to use netfilter anyway.


> Expanding helpers of this kind to directly
>    manipulate nftables rules, or installing BPF programs, looks like a
>    substantial security drawback
>
> > We do not want unlikely tests in the fast path.
>
> I understand, but note that this doesn't really add a branch to the
> fast path: there's already a list of (more expensive) conditions under
> which we need to fall back to slow path, with 'tp' definitely
> prefetched at that point, so I don't expect any fast path cost from
> doing this. Everything else is handled in the slow path.
>
> To be sure, I also checked throughput with delivery to local sockets as
> that's the only case affected (veth setup similar to the one from the
> selftests from patch 2/2) and there's no visible difference.
>
> > You can find a similar issue:
> > https://lore.kernel.org/netdev/[email protected]/
>
> That one is not a kernel issue: in that case, the socket is closed, so
> it's actually expected that the kernel will reset the connection. As
> Jakub pointed out, that patch introduces a race condition on its own,
> and it's a hack rather than a fix.
>
> We happened to have that kind of issue in passt as well (the
> implementation is inspired by CRIU), but that's something entirely
> different which userspace clearly needs to take care of, so we fixed it
> here:
>
>   https://passt.top/passt/commit/?id=a8782865c342eb2682cca292d5bf92b567344351
>
> > > passt enables repair mode, dumps the sequence
> > >    number (101) and sends it to the target node of the guest migration
> > >
> > > 4. client --seq 101:201--> origin host (passt not receiving anymore)
> > >
> > > 5. client <--ACK: 201-- origin host
> > >
> > > 6. migration completes, and passt restores sequence number 101 on the
> > >    migrated socket
> > >
> > > 7. client --seq 201:301--> target host (now seeing a sequence jump)
> > >
> > > 8. client <--ACK: 100-- target host
> > >
> > > ...and the connection can't recover anymore, because the client can't
> > > resend data that was already (erroneously) acknowledged. We need to
> > > avoid step 5. above.
> > >
> > > This would equally affect CRIU (the other known user of TCP_REPAIR),
> > > should data be received while the original container is frozen: the
> > > sequence dumped and the contents of the saved incoming queue would
> > > then depend on the timing.
> > >
> > > The race condition is also illustrated in the kselftests introduced
> > > by the next patch.
> > >
> > > To prevent this issue, discard data received for a socket in repair
> > > mode, with a new reason, SKB_DROP_REASON_SOCKET_REPAIR.
> > >
> > > Fixes: ee9952831cfd ("tcp: Initial repair mode")
> > > Tested-by: Laurent Vivier <[email protected]>
> > > Signed-off-by: Stefano Brivio <[email protected]>
> > > ---
> > >  include/net/dropreason-core.h |  3 +++
> > >  net/ipv4/tcp_input.c          | 14 +++++++++++++-
> > >  2 files changed, 16 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/net/dropreason-core.h b/include/net/dropreason-core.h
> > > index 2f312d1f67d6..19ab9e6ffc33 100644
> > > --- a/include/net/dropreason-core.h
> > > +++ b/include/net/dropreason-core.h
> > > @@ -9,6 +9,7 @@
> > >         FN(SOCKET_CLOSE)                \
> > >         FN(SOCKET_FILTER)               \
> > >         FN(SOCKET_RCVBUFF)              \
> > > +       FN(SOCKET_REPAIR)               \
> > >         FN(UNIX_DISCONNECT)             \
> > >         FN(UNIX_SKIP_OOB)               \
> > >         FN(PKT_TOO_SMALL)               \
> > > @@ -158,6 +159,8 @@ enum skb_drop_reason {
> > >         SKB_DROP_REASON_SOCKET_FILTER,
> > >         /** @SKB_DROP_REASON_SOCKET_RCVBUFF: socket receive buff is full 
> > > */
> > >         SKB_DROP_REASON_SOCKET_RCVBUFF,
> > > +       /** @SKB_DROP_REASON_SOCKET_REPAIR: socket is in repair mode */
> > > +       SKB_DROP_REASON_SOCKET_REPAIR,
> > >         /**
> > >          * @SKB_DROP_REASON_UNIX_DISCONNECT: recv queue is purged when 
> > > SOCK_DGRAM
> > >          * or SOCK_SEQPACKET socket re-connect()s to another socket or 
> > > notices
> > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > index d5c9e65d9760..6eca34274f97 100644
> > > --- a/net/ipv4/tcp_input.c
> > > +++ b/net/ipv4/tcp_input.c
> > > @@ -6457,6 +6457,7 @@ static bool tcp_validate_incoming(struct sock *sk, 
> > > struct sk_buff *skb,
> > >   *       or pure receivers (this means either the sequence number or the 
> > > ack
> > >   *       value must stay constant)
> > >   *     - Unexpected TCP option.
> > > + *     - Socket is in repair mode.
> > >   *
> > >   *     When these conditions are not satisfied it drops into a standard
> > >   *     receive procedure patterned after RFC793 to handle all cases.
> > > @@ -6506,7 +6507,8 @@ void tcp_rcv_established(struct sock *sk, struct 
> > > sk_buff *skb)
> > >
> > >         if ((tcp_flag_word(th) & TCP_HP_BITS) == tp->pred_flags &&
> > >             TCP_SKB_CB(skb)->seq == tp->rcv_nxt &&
> > > -           !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) {
> > > +           !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt) &&
> > > +           !tp->repair) {
> > >                 int tcp_header_len = tp->tcp_header_len;
> > >                 s32 delta = 0;
> > >                 int flag = 0;
> > > @@ -6632,6 +6634,11 @@ void tcp_rcv_established(struct sock *sk, struct 
> > > sk_buff *skb)
> > >                 goto discard;
> > >         }
> > >
> > > +       if (tp->repair) {
> > > +               reason = SKB_DROP_REASON_SOCKET_REPAIR;
> > > +               goto discard;
> > > +       }
> > > +
> > >         /*
> > >          *      Standard slow path.
> > >          */
> > > @@ -7125,6 +7132,11 @@ tcp_rcv_state_process(struct sock *sk, struct 
> > > sk_buff *skb)
> > >         int queued = 0;
> > >         SKB_DR(reason);
> > >
> > > +       if (tp->repair) {
> > > +               SKB_DR_SET(reason, SOCKET_REPAIR);
> > > +               goto discard;
> > > +       }
> > > +
> > >         switch (sk->sk_state) {
> > >         case TCP_CLOSE:
> > >                 SKB_DR_SET(reason, TCP_CLOSE);
> > > --
> > > 2.43.0
>
> --
> Stefano
>

Reply via email to