On Fri, Jun 26, 2020 at 10:58:14AM -0700, Cong Wang wrote: > On Thu, Jun 25, 2020 at 10:23 PM Cameron Berkenpas <c...@neo-zeon.de> wrote: > > > > Hello, > > > > Somewhere along the way I got the impression that it generally takes > > those affected hours before their systems lock up. I'm (generally) able > > to reproduce this issue much faster than that. Regardless, I can help test. > > > > Are there any patches that need testing or is this all still pending > > discussion around the best way to resolve the issue? > > Yes. I come up with a (hopefully) much better patch in the attachment. > Can you help to test it? You need to unapply the previous patch before > applying this one. > > (Just in case of any confusion: I still believe we should check NULL on > top of this refcnt fix. But it should be a separate patch.) > > Thank you!
Not opposing the patch, but the Fixes tag is still confusing me. Do we have an explanation for what's wrong with 4bfc0bb2c60e? It looks like we have cgroup_bpf_get()/put() exactly where we have cgroup_get()/put(), so it would be nice to understand what's different if the problem is bpf-related. Thanks! > commit 259150604c0b77c717fdaab057da5722e2dfd922 > Author: Cong Wang <xiyou.wangc...@gmail.com> > Date: Sat Jun 13 12:34:40 2020 -0700 > > cgroup: fix cgroup_sk_alloc() for sk_clone_lock() > > When we clone a socket in sk_clone_lock(), its sk_cgrp_data is > copied, so the cgroup refcnt must be taken too. And, unlike the > sk_alloc() path, sock_update_netprioidx() is not called here. > Therefore, it is safe and necessary to grab the cgroup refcnt > even when cgroup_sk_alloc is disabled. > > sk_clone_lock() is in BH context anyway, the in_interrupt() > would terminate this function if called there. And for sk_alloc() > skcd->val is always zero. So it's safe to factor out the code > to make it more readable. > > Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from > cgroup itself") > Reported-by: Cameron Berkenpas <c...@neo-zeon.de> > Reported-by: Peter Geis <pgwipe...@gmail.com> > Reported-by: Lu Fengqi <lufq.f...@cn.fujitsu.com> > Reported-by: Daniƫl Sonck <dsonc...@gmail.com> > Tested-by: Cameron Berkenpas <c...@neo-zeon.de> > Cc: Daniel Borkmann <dan...@iogearbox.net> > Cc: Zefan Li <lize...@huawei.com> > Cc: Tejun Heo <t...@kernel.org> > Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com> > > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h > index 52661155f85f..4f1cd0edc57d 100644 > --- a/include/linux/cgroup-defs.h > +++ b/include/linux/cgroup-defs.h > @@ -790,7 +790,8 @@ struct sock_cgroup_data { > union { > #ifdef __LITTLE_ENDIAN > struct { > - u8 is_data; > + u8 is_data : 1; > + u8 no_refcnt : 1; > u8 padding; > u16 prioidx; > u32 classid; > @@ -800,7 +801,8 @@ struct sock_cgroup_data { > u32 classid; > u16 prioidx; > u8 padding; > - u8 is_data; > + u8 no_refcnt : 1; > + u8 is_data : 1; > } __packed; > #endif > u64 val; > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h > index 4598e4da6b1b..618838c48313 100644 > --- a/include/linux/cgroup.h > +++ b/include/linux/cgroup.h > @@ -822,6 +822,7 @@ extern spinlock_t cgroup_sk_update_lock; > > void cgroup_sk_alloc_disable(void); > void cgroup_sk_alloc(struct sock_cgroup_data *skcd); > +void cgroup_sk_clone(struct sock_cgroup_data *skcd); > void cgroup_sk_free(struct sock_cgroup_data *skcd); > > static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd) > @@ -835,7 +836,7 @@ static inline struct cgroup *sock_cgroup_ptr(struct > sock_cgroup_data *skcd) > */ > v = READ_ONCE(skcd->val); > > - if (v & 1) > + if (v & 3) > return &cgrp_dfl_root.cgrp; > > return (struct cgroup *)(unsigned long)v ?: &cgrp_dfl_root.cgrp; > @@ -847,6 +848,7 @@ static inline struct cgroup *sock_cgroup_ptr(struct > sock_cgroup_data *skcd) > #else /* CONFIG_CGROUP_DATA */ > > static inline void cgroup_sk_alloc(struct sock_cgroup_data *skcd) {} > +static inline void cgroup_sk_clone(struct sock_cgroup_data *skcd) {} > static inline void cgroup_sk_free(struct sock_cgroup_data *skcd) {} > > #endif /* CONFIG_CGROUP_DATA */ > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index 1ea181a58465..dd247747ec14 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -6439,18 +6439,8 @@ void cgroup_sk_alloc_disable(void) > > void cgroup_sk_alloc(struct sock_cgroup_data *skcd) > { > - if (cgroup_sk_alloc_disabled) > - return; > - > - /* Socket clone path */ > - if (skcd->val) { > - /* > - * We might be cloning a socket which is left in an empty > - * cgroup and the cgroup might have already been rmdir'd. > - * Don't use cgroup_get_live(). > - */ > - cgroup_get(sock_cgroup_ptr(skcd)); > - cgroup_bpf_get(sock_cgroup_ptr(skcd)); > + if (cgroup_sk_alloc_disabled) { > + skcd->no_refcnt = 1; > return; > } > > @@ -6475,10 +6465,27 @@ void cgroup_sk_alloc(struct sock_cgroup_data *skcd) > rcu_read_unlock(); > } > > +void cgroup_sk_clone(struct sock_cgroup_data *skcd) > +{ > + if (skcd->val) { > + if (skcd->no_refcnt) > + return; > + /* > + * We might be cloning a socket which is left in an empty > + * cgroup and the cgroup might have already been rmdir'd. > + * Don't use cgroup_get_live(). > + */ > + cgroup_get(sock_cgroup_ptr(skcd)); > + cgroup_bpf_get(sock_cgroup_ptr(skcd)); > + } > +} > + > void cgroup_sk_free(struct sock_cgroup_data *skcd) > { > struct cgroup *cgrp = sock_cgroup_ptr(skcd); > > + if (skcd->no_refcnt) > + return; > cgroup_bpf_put(cgrp); > cgroup_put(cgrp); > } > diff --git a/net/core/sock.c b/net/core/sock.c > index d832c650287c..2e5b7870e5d3 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -1926,7 +1926,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const > gfp_t priority) > /* sk->sk_memcg will be populated at accept() time */ > newsk->sk_memcg = NULL; > > - cgroup_sk_alloc(&newsk->sk_cgrp_data); > + cgroup_sk_clone(&newsk->sk_cgrp_data); > > rcu_read_lock(); > filter = rcu_dereference(sk->sk_filter);