Jakub Kicinski <jakub.kicin...@netronome.com> writes: > On Mon, 04 Mar 2019 12:58:51 +0100, Toke Høiland-Jørgensen wrote: >> >> diff --git a/include/net/netns/xdp.h b/include/net/netns/xdp.h >> >> index e5734261ba0a..4935dfe1cf43 100644 >> >> --- a/include/net/netns/xdp.h >> >> +++ b/include/net/netns/xdp.h >> >> @@ -4,10 +4,21 @@ >> >> >> >> #include <linux/rculist.h> >> >> #include <linux/mutex.h> >> >> +#include <linux/atomic.h> >> >> + >> >> +struct bpf_dtab; >> >> + >> >> +struct bpf_dtab_container { >> >> + struct bpf_dtab __rcu *dtab; >> >> + atomic_t refcnt; >> > >> > refcount_t ? I'm not sure what the rules are for non-performance >> > critical stuff are.. >> >> I based the use of atomic_t on what bpf/syscall.c is doing. The problem >> with using refcount_t is that it's a bit of a weird refcount because >> it's not strictly tied to the object lifetime (as the objects are only >> allocated if *both* the globals and the per-namespace refcnt are >0). >> >> >> +static void __dev_map_release_default_map(struct bpf_dtab_container >> >> *cont) >> >> +{ >> >> + struct bpf_dtab *dtab = NULL; >> >> + >> >> + lockdep_assert_held(&dev_map_lock); >> >> + >> >> + dtab = rcu_dereference(cont->dtab); >> >> + if (dtab) { >> > >> > How can it be NULL if it's refcounted? hm.. >> >> See above; it's not actually the object that is refcounted, it's a count >> of the number of active XDP programs in the namespace, which still needs >> to be kept so we can allocate the devmaps when a REDIRECT program is >> loaded. > > I think I got it now. I wonder if anything could be done to improve > the readability of the dual-refcnting :S Maybe use the "needed"/"used" > terms throughout?
Yeah, good idea, I'll try something along those lines. >> >> + list_del_rcu(&dtab->list); >> >> + rcu_assign_pointer(cont->dtab, NULL); >> >> + bpf_clear_redirect_map(&dtab->map); >> >> + call_rcu(&dtab->rcu, __dev_map_free); >> >> + } >> >> +} >> >> + >> >> +void dev_map_put_default_map(struct net *net) >> >> +{ >> >> + if (atomic_dec_and_test(&net->xdp.default_map.refcnt)) { >> >> + spin_lock(&dev_map_lock); >> >> + __dev_map_release_default_map(&net->xdp.default_map); >> >> + spin_unlock(&dev_map_lock); >> >> + } >> >> +} >> >> + >> >> +static int __init_default_map(struct bpf_dtab_container *cont) >> >> +{ >> >> + struct net *net = bpf_default_map_to_net(cont); >> >> + struct bpf_dtab *dtab, *old_dtab; >> >> + int size = DEV_MAP_DEFAULT_SIZE; >> >> + struct net_device *netdev; >> >> + union bpf_attr attr = {}; >> >> + u32 idx; >> >> + int err; >> >> + >> >> + lockdep_assert_held(&dev_map_lock); >> >> + >> >> + if (!atomic_read(&global_redirect_use)) >> >> + return 0; >> >> + >> >> + for_each_netdev(net, netdev) >> >> + if (netdev->ifindex >= size) >> >> + size <<= 1; >> >> + >> >> + old_dtab = rcu_dereference(cont->dtab); >> >> + if (old_dtab && old_dtab->map.max_entries == size) >> >> + return 0; >> >> + >> >> + dtab = kzalloc(sizeof(*dtab), GFP_USER); >> > >> > You are calling this with a spin lock held :( >> >> Yeah, not just this, there are more allocations in dev_map_init_map(). >> Hmm, I guess I need to think about the concurrency bits some more; it is >> made a bit tricky by the refcounts being outside the objects. >> >> Basically what I'm trying to protect against is the case where the >> global refcount goes to zero and then immediately back to one (or vice >> versa), which results in a deallocation immediately followed by an >> allocation. I was worried about the allocation and deallocation racing >> with each other (because there's a window after the refcnt is checked >> before the (de)allocation begins). It's not quite clear to me whether >> RCU is enough to protect against this... Thoughts? > > Hm. I think you'll still need a lock (mutex?) on the alloc path, but > the free path should be fine as long as you load the map pointer before > looking at the refcnt (atomic op ensuring the barrier there). Yeah, for the per-namespace refcnt it's pretty straight forward, the trouble is the global count that needs to iterate over all namespaces; probably need to put that all behind a (non-spin)lock, right? >> > Could you please add tests for the replacement combinations? >> >> Sure. But, erm, where? selftests? :) > > Yes :) selftests/bpf/ Look around at the scripts, AFAIK we don't have > much in the way of standard infra for system interaction tests like > this :S Right. I'll shamelessly copy from the tests already in there and see if I can't get something decent working :) -Toke