On Mon, Jun 15, 2020 at 6:06 AM Daniël Sonck <dsonc...@gmail.com> wrote: > > Op zo 14 jun. 2020 om 22:43 schreef Daniël Sonck <dsonc...@gmail.com>: > > > > Hello, > > > > Op zo 14 jun. 2020 om 20:29 schreef Cong Wang <xiyou.wangc...@gmail.com>: > > > > > > Hello, > > > > > > On Sun, Jun 14, 2020 at 5:39 AM Daniël Sonck <dsonc...@gmail.com> wrote: > > > > > > > > Hello, > > > > > > > > I found on the archive that this bug I encountered also happened to > > > > others. I too have a very similar stacktrace. The issue I'm > > > > experiencing is: > > > > > > > > Whenever I fully boot my cluster, in some time, the host crashes with > > > > the __cgroup_bpf_run_filter_skb NULL pointer dereference. This has > > > > been sporadic enough before not to cause real issues. However, as of > > > > lately, the bug is triggered much more frequently. I've changed my > > > > server hardware so I could capture serial output in order to get the > > > > trace. This trace looked very similar as reported by Lu Fengqi. As it > > > > currently stands, I cannot run the cluster as it's almost instantly > > > > crashing the host. > > > > > > This has been reported for multiple times. Are you able to test the > > > attached patch? And let me know if everything goes fine with it. > > > > I will try out the patch. Since the host reliably crashed each time as > > I booted up > > the cluster VMs I will be able to tell whether it has any positive effect. > > > > > > I suspect we may still leak some cgroup refcnt even with the patch, > > > but it might be much harder to trigger with this patch applied. > > > > Currently applying the patch to the kernel and compiling so I should > > know in a few hours > > The compilation with the patch has finished and I've since rebooted to the > new kernel about 12 hours ago, so far this bug did not trigger whereas without > the patch, by this time it would have triggered. Regardless, I will keep my > serial connection in case something pops up.
That is great. Please keep it running as this is a race condition which is not easy to trigger reliably. Thanks for testing!