Joe Stringer <j...@wand.net.nz> wrote: > As discussed during LSFMM, I've been looking at adding something like > an `skb_sk_assign()` helper to BPF so that logic similar to TPROXY can > be implemented with integration into other BPF logic, however > currently any attempts to do so are blocked by the skb_orphan() call > in ip_rcv_core() (which will effectively ignore any socket assign > decision made by the TC BPF program). > > Recently I was attempting to remove the skb_orphan() call, and I've > been trying different things but there seems to be some context I'm > missing. Here's the core of the patch: > > diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c > index ed97724c5e33..16aea980318a 100644 > --- a/net/ipv4/ip_input.c > +++ b/net/ipv4/ip_input.c > @@ -500,8 +500,6 @@ static struct sk_buff *ip_rcv_core(struct sk_buff > *skb, struct net *net) > memset(IPCB(skb), 0, sizeof(struct inet_skb_parm)); > IPCB(skb)->iif = skb->skb_iif; > > - /* Must drop socket now because of tproxy. */ > - skb_orphan(skb); > > return skb; > > The statement that the socket must be dropped because of tproxy > doesn't make sense to me, because the PRE_ROUTING hook is hit after > this, which will call into the tproxy logic and eventually > nf_tproxy_assign_sock() which already does the skb_orphan() itself.
in comment: s/tproxy/skb_steal_sock/ at least thats what I concluded a few years ago when I looked into the skb_oprhan() need. IIRC some device drivers use skb->sk for backpressure, so without this non-tcp socket would be stolen by skb_steal_sock. We also recently removed skb orphan when crossing netns: commit 9c4c325252c54b34d53b3d0ffd535182b744e03d Author: Flavio Leitner <f...@redhat.com> skbuff: preserve sock reference when scrubbing the skb. So thats another case where this orphan is needed. What could be done is adding some way to delay/defer the orphaning further, but we would need at the very least some annotation for skb_steal_sock to know when the skb->sk is really from TPROXY or if it has to orphan. Same for the safety check in the forwarding path. Netfilter modules need o be audited as well, they might make assumptions wrt. skb->sk being inet sockets (set by local stack or early demux). > However, if I drop these lines then I end up causing sockets to > release references too many times. Seems like if we don't orphan the > skb here, then later logic assumes that we have one more reference > than we actually have, and decrements the count when it shouldn't > (perhaps the skb_steal_sock() call in __inet_lookup_skb() which seems > to assume we always have a reference to the socket?) We might be calling the wrong destructor (i.e., the one set by tcp receive instead of the one set at tx time)?