On 12/10/2018 06:27 PM, Michael Roth wrote: > Quoting Daniel Borkmann (2018-12-10 08:26:31) >> On 12/07/2018 04:36 PM, Michael Roth wrote: >>> Quoting Michael Ellerman (2018-12-07 06:31:13) >>>> Michael Roth <mdr...@linux.vnet.ibm.com> writes: >>>> >>>>> Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF >>>>> JIT allocations. At compile time it defaults to PAGE_SIZE * 40000, >>>>> and is adjusted again at init time if MODULES_VADDR is defined. >>>>> >>>>> For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with >>>> >>>> But maybe it should be, I don't know why we don't define it. >>>> >>>>> the compile-time default at boot-time, which is 0x9c400000 when >>>>> using 64K page size. This overflows the signed 32-bit bpf_jit_limit >>>>> value: >>>>> >>>>> root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit >>>>> -1673527296 >>>>> >>>>> and can cause various unexpected failures throughout the network >>>>> stack. In one case `strace dhclient eth0` reported: >>>>> >>>>> setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, >>>>> filter=0x105dd27f8}, 16) = -1 ENOTSUPP (Unknown error 524) >>>>> >>>>> and similar failures can be seen with tools like tcpdump. This doesn't >>>>> always reproduce however, and I'm not sure why. The more consistent >>>>> failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9 host >>>>> would time out on systemd/netplan configuring a virtio-net NIC with no >>>>> noticeable errors in the logs. >>>>> >>>>> Fix this by limiting the compile-time default for bpf_jit_limit to >>>>> INT_MAX. >>>> >>>> INT_MAX is a lot more than (4k * 40000), so I guess I'm not clear on >>>> whether we should be using PAGE_SIZE here at all. I guess each BPF >>>> program uses at least one page is the thinking? >>> >>> That seems to be the case, at least, the max number of minimum-sized >>> allocations would be less on ppc64 since the allocations are always at >>> least PAGE_SIZE in size. The init-time default also limits to INT_MAX, >>> so it seemed consistent to do that here too. >>> >>>> >>>> Thanks for tracking this down. For some reason none of my ~10 test boxes >>>> have hit this, perhaps I don't have new enough userspace? >>> >>> I'm not too sure, I would've thought things like the dhclient case in >>> the commit log would fail every time, but sometimes I need to reboot the >>> guest before I start seeing the behavior. Maybe there's something special >>> about when JIT allocations are actually done that can affect >>> reproducibility? >>> >>> In my case at least the virtio-net networking timeout was consistent >>> enough for a bisect, but maybe it depends on the specific network >>> configuration (single NIC, basic DHCP through netplan/systemd in my case). >>> >>>> >>>> You don't mention why you needed to add BPF_MIN(), I assume because the >>>> kernel version of min() has gotten too complicated to work here? >>> >>> I wasn't sure if it was safe here or not, so I tried looking at other >>> users and came across: >>> >>> mm/vmalloc.c:777:#define VMAP_MIN(x, y) ((x) < (y) ? (x) : >>> (y)) /* can't use min() */ >>> >>> I'm not sure what the reasoning was (or whether it still applies), but I >>> figured it was safer to do the same here. Maybe Nick still recalls? >>> >>>> >>>> Daniel I assume you'll merge this via your tree? >>>> >>>> cheers >>>> >>>>> Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv >>>>> allocations") >>>>> Cc: linuxppc-...@ozlabs.org >>>>> Cc: Daniel Borkmann <dan...@iogearbox.net> >>>>> Cc: Sandipan Das <sandi...@linux.ibm.com> >>>>> Cc: Alexei Starovoitov <a...@kernel.org> >>>>> Signed-off-by: Michael Roth <mdr...@linux.vnet.ibm.com> >> >> Thanks for the reports / fixes and sorry for my late reply (bit too >> swamped last week), some more thoughts below. >> >>>>> kernel/bpf/core.c | 3 ++- >>>>> 1 file changed, 2 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c >>>>> index b1a3545d0ec8..55de4746cdfd 100644 >>>>> --- a/kernel/bpf/core.c >>>>> +++ b/kernel/bpf/core.c >>>>> @@ -365,7 +365,8 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp) >>>>> } >>>>> >>>>> #ifdef CONFIG_BPF_JIT >>>>> -# define BPF_JIT_LIMIT_DEFAULT (PAGE_SIZE * 40000) >>>>> +# define BPF_MIN(x, y) ((x) < (y) ? (x) : (y)) >>>>> +# define BPF_JIT_LIMIT_DEFAULT BPF_MIN((PAGE_SIZE * 40000), >>>>> INT_MAX) >>>>> >>>>> /* All BPF JIT sysctl knobs here. */ >>>>> int bpf_jit_enable __read_mostly = >>>>> IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON); >> >> I would actually just like to get rid of the BPF_JIT_LIMIT_DEFAULT >> define also given for 4.21 arm64 will have its own dedicated area for >> JIT allocations where neither the above limit nor the MODULES_END/ >> MODULES_VADDR one would fit and I don't want to make this even more >> ugly with adding further cases into the core. Would the below variant >> work for you? > > Looks good to me. My one concern (which is probably a separate > issue) is that the INT_MAX limit is a bit more punishing for larger > page sizes since the minimum allocations seem to be 1 page. Are there > reasonable workloads that could actually push this (INT_MAX >> > 64K_PAGE_SHIFT) limit, or is that pretty generous in practice?
I don't think there are today, but it definitely doesn't hurt to change the interface to be on safe side which I just did for the sent out one. Thanks, Daniel