On 12/10/2018 06:27 PM, Michael Roth wrote:
> Quoting Daniel Borkmann (2018-12-10 08:26:31)
>> On 12/07/2018 04:36 PM, Michael Roth wrote:
>>> Quoting Michael Ellerman (2018-12-07 06:31:13)
>>>> Michael Roth <mdr...@linux.vnet.ibm.com> writes:
>>>>
>>>>> Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
>>>>> JIT allocations. At compile time it defaults to PAGE_SIZE * 40000,
>>>>> and is adjusted again at init time if MODULES_VADDR is defined.
>>>>>
>>>>> For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
>>>>
>>>> But maybe it should be, I don't know why we don't define it.
>>>>
>>>>> the compile-time default at boot-time, which is 0x9c400000 when
>>>>> using 64K page size. This overflows the signed 32-bit bpf_jit_limit
>>>>> value:
>>>>>
>>>>>   root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
>>>>>   -1673527296
>>>>>
>>>>> and can cause various unexpected failures throughout the network
>>>>> stack. In one case `strace dhclient eth0` reported:
>>>>>
>>>>>   setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, 
>>>>> filter=0x105dd27f8}, 16) = -1 ENOTSUPP (Unknown error 524)
>>>>>
>>>>> and similar failures can be seen with tools like tcpdump. This doesn't
>>>>> always reproduce however, and I'm not sure why. The more consistent
>>>>> failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9 host
>>>>> would time out on systemd/netplan configuring a virtio-net NIC with no
>>>>> noticeable errors in the logs.
>>>>>
>>>>> Fix this by limiting the compile-time default for bpf_jit_limit to
>>>>> INT_MAX.
>>>>
>>>> INT_MAX is a lot more than (4k * 40000), so I guess I'm not clear on
>>>> whether we should be using PAGE_SIZE here at all. I guess each BPF
>>>> program uses at least one page is the thinking?
>>>
>>> That seems to be the case, at least, the max number of minimum-sized
>>> allocations would be less on ppc64 since the allocations are always at
>>> least PAGE_SIZE in size. The init-time default also limits to INT_MAX,
>>> so it seemed consistent to do that here too.
>>>
>>>>
>>>> Thanks for tracking this down. For some reason none of my ~10 test boxes
>>>> have hit this, perhaps I don't have new enough userspace?
>>>
>>> I'm not too sure, I would've thought things like the dhclient case in
>>> the commit log would fail every time, but sometimes I need to reboot the
>>> guest before I start seeing the behavior. Maybe there's something special
>>> about when JIT allocations are actually done that can affect
>>> reproducibility?
>>>
>>> In my case at least the virtio-net networking timeout was consistent
>>> enough for a bisect, but maybe it depends on the specific network
>>> configuration (single NIC, basic DHCP through netplan/systemd in my case).
>>>
>>>>
>>>> You don't mention why you needed to add BPF_MIN(), I assume because the
>>>> kernel version of min() has gotten too complicated to work here?
>>>
>>> I wasn't sure if it was safe here or not, so I tried looking at other
>>> users and came across:
>>>
>>> mm/vmalloc.c:777:#define VMAP_MIN(x, y)               ((x) < (y) ? (x) : 
>>> (y)) /* can't use min() */
>>>
>>> I'm not sure what the reasoning was (or whether it still applies), but I
>>> figured it was safer to do the same here. Maybe Nick still recalls?
>>>
>>>>
>>>> Daniel I assume you'll merge this via your tree?
>>>>
>>>> cheers
>>>>
>>>>> Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv 
>>>>> allocations")
>>>>> Cc: linuxppc-...@ozlabs.org
>>>>> Cc: Daniel Borkmann <dan...@iogearbox.net>
>>>>> Cc: Sandipan Das <sandi...@linux.ibm.com>
>>>>> Cc: Alexei Starovoitov <a...@kernel.org>
>>>>> Signed-off-by: Michael Roth <mdr...@linux.vnet.ibm.com>
>>
>> Thanks for the reports / fixes and sorry for my late reply (bit too
>> swamped last week), some more thoughts below.
>>
>>>>>  kernel/bpf/core.c | 3 ++-
>>>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>>>>> index b1a3545d0ec8..55de4746cdfd 100644
>>>>> --- a/kernel/bpf/core.c
>>>>> +++ b/kernel/bpf/core.c
>>>>> @@ -365,7 +365,8 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
>>>>>  }
>>>>>  
>>>>>  #ifdef CONFIG_BPF_JIT
>>>>> -# define BPF_JIT_LIMIT_DEFAULT       (PAGE_SIZE * 40000)
>>>>> +# define BPF_MIN(x, y) ((x) < (y) ? (x) : (y))
>>>>> +# define BPF_JIT_LIMIT_DEFAULT       BPF_MIN((PAGE_SIZE * 40000), 
>>>>> INT_MAX)
>>>>>  
>>>>>  /* All BPF JIT sysctl knobs here. */
>>>>>  int bpf_jit_enable   __read_mostly = 
>>>>> IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);
>>
>> I would actually just like to get rid of the BPF_JIT_LIMIT_DEFAULT
>> define also given for 4.21 arm64 will have its own dedicated area for
>> JIT allocations where neither the above limit nor the MODULES_END/
>> MODULES_VADDR one would fit and I don't want to make this even more
>> ugly with adding further cases into the core. Would the below variant
>> work for you?
> 
> Looks good to me. My one concern (which is probably a separate
> issue) is that the INT_MAX limit is a bit more punishing for larger
> page sizes since the minimum allocations seem to be 1 page. Are there
> reasonable workloads that could actually push this (INT_MAX >>
> 64K_PAGE_SHIFT) limit, or is that pretty generous in practice?

I don't think there are today, but it definitely doesn't hurt to change
the interface to be on safe side which I just did for the sent out one.

Thanks,
Daniel

Reply via email to