> On Jul 23, 2019, at 6:40 PM, Andy Lutomirski <l...@amacapital.net> wrote: > > > >> On Jul 23, 2019, at 3:56 PM, Song Liu <songliubrav...@fb.com> wrote: >> >> >> >>> On Jul 23, 2019, at 8:11 AM, Andy Lutomirski <l...@kernel.org> wrote: >>> >>> On Mon, Jul 22, 2019 at 1:54 PM Song Liu <songliubrav...@fb.com> wrote: >>>> >>>> Hi Andy, Lorenz, and all, >>>> >>>>> On Jul 2, 2019, at 2:32 PM, Andy Lutomirski <l...@kernel.org> wrote: >>>>> >>>>> On Tue, Jul 2, 2019 at 2:04 PM Kees Cook <keesc...@chromium.org> wrote: >>>>>> >>>>>>> On Mon, Jul 01, 2019 at 06:59:13PM -0700, Andy Lutomirski wrote: >>>>>>> I think I'm understanding your motivation. You're not trying to make >>>>>>> bpf() generically usable without privilege -- you're trying to create >>>>>>> a way to allow certain users to access dangerous bpf functionality >>>>>>> within some limits. >>>>>>> >>>>>>> That's a perfectly fine goal, but I think you're reinventing the >>>>>>> wheel, and the wheel you're reinventing is quite complicated and >>>>>>> already exists. I think you should teach bpftool to be secure when >>>>>>> installed setuid root or with fscaps enabled and put your policy in >>>>>>> bpftool. If you want to harden this a little bit, it would seem >>>>>>> entirely reasonable to add a new CAP_BPF_ADMIN and change some, but >>>>>>> not all, of the capable() checks to check CAP_BPF_ADMIN instead of the >>>>>>> capabilities that they currently check. >>>>>> >>>>>> If finer grained controls are wanted, it does seem like the /dev/bpf >>>>>> path makes the most sense. open, request abilities, use fd. The open can >>>>>> be mediated by DAC and LSM. The request can be mediated by LSM. This >>>>>> provides a way to add policy at the LSM level and at the tool level. >>>>>> (i.e. For tool-level controls: leave LSM wide open, make /dev/bpf owned >>>>>> by "bpfadmin" and bpftool becomes setuid "bpfadmin". For fine-grained >>>>>> controls, leave /dev/bpf wide open and add policy to SELinux, etc.) >>>>>> >>>>>> With only a new CAP, you don't get the fine-grained controls. (The >>>>>> "request abilities" part is the key there.) >>>>> >>>>> Sure you do: the effective set. It has somewhat bizarre defaults, but >>>>> I don't think that's a real problem. Also, this wouldn't be like >>>>> CAP_DAC_READ_SEARCH -- you can't accidentally use your BPF caps. >>>>> >>>>> I think that a /dev capability-like object isn't totally nuts, but I >>>>> think we should do it well, and this patch doesn't really achieve >>>>> that. But I don't think bpf wants fine-grained controls like this at >>>>> all -- as I pointed upthread, a fine-grained solution really wants >>>>> different treatment for the different capable() checks, and a bunch of >>>>> them won't resemble capabilities or /dev/bpf at all. >>>> >>>> With 5.3-rc1 out, I am back on this. :) >>>> >>>> How about we modify the set as: >>>> 1. Introduce sys_bpf_with_cap() that takes fd of /dev/bpf. >>> >>> I'm fine with this in principle, but: >>> >>>> 2. Better handling of capable() calls through bpf code. I guess the >>>> biggest problem here is is_priv in verifier.c:bpf_check(). >>> >>> I think it would be good to understand exactly what /dev/bpf will >>> enable one to do. Without some care, it would just become the next >>> CAP_SYS_ADMIN: if you can open it, sure, you're not root, but you can >>> intercept network traffic, modify cgroup behavior, and do plenty of >>> other things, any of which can probably be used to completely take >>> over the system. >> >> Well, yes. sys_bpf() is pretty powerful. >> >> The goal of /dev/bpf is to enable special users to call sys_bpf(). In >> the meanwhile, such users should not take down the whole system easily >> by accident, e.g., with rm -rf /. > > That’s easy, though — bpftool could learn to read /etc/bpfusers before > allowing ruid != 0.
This is a great idea! fscaps + /etc/bpfusers should do the trick. > >> >> It is similar to CAP_BPF_ADMIN, without really adding the CAP_. >> >> I think adding new CAP_ requires much more effort. >> > > A new CAP_ is straightforward — add the definition and change the max cap. > >>> >>> It would also be nice to understand why you can't do what you need to >>> do entirely in user code using setuid or fscaps. >> >> It is not very easy to achieve the same control: only certain users can >> run certain tools (bpftool, etc.). >> >> The closest approach I can find is: >> 1. use libcap (pam_cap) to give CAP_SETUID to certain users; >> 2. add setuid(0) to bpftool. >> >> The difference between this approach and /dev/bpf is that certain users >> would be able to run other tools that call setuid(). Though I am not >> sure how many tools call setuid(), and how risky they are. > > I think you’re misunderstanding me. Install bpftool with either the setuid > (S_ISUID) mode or with an appropriate fscap bit — see the setcap(8) manpage. > > The downside of this approach is that it won’t work well in a container, and > containers are cool these days :) > >> >>> >>> Finally, at risk of rehashing some old arguments, I'll point out that >>> the bpf() syscall is an unusual design to begin with. As an example, >>> consider bpf_prog_attach(). Outside of bpf(), if I want to change the >>> behavior of a cgroup, I would write to a file in >>> /sys/kernel/cgroup/unified/whatever/, and normal DAC and MAC rules >>> apply. With bpf(), however, I just call bpf() to attach a program to >>> the cgroup. bpf() says "oh, you are capable(CAP_NET_ADMIN) -- go for >>> it!". Unless I missed something major, and I just re-read the code, >>> there is no check that the caller has write or LSM permission to >>> anything at all in cgroupfs, and the existing API would make it very >>> awkward to impose any kind of DAC rules here. >>> >>> So I think it might actually be time to repay some techincal debt and >>> come up with a real fix. As a less intrusive approach, you could see >>> about requiring ownership of the cgroup directory instead of >>> CAP_NET_ADMIN. As a more intrusive but perhaps better approach, you >>> could invert the logic to to make it work like everything outside of >>> cgroup: add pseudo-files like bpf.inet_ingress to the cgroup >>> directories, and require a writable fd to *that* to a new improved >>> attach API. If a user could do: >>> >>> int fd = open("/sys/fs/cgroup/.../bpf.inet_attach", O_RDWR); /* usual >>> DAC and MAC policy applies */ >>> int bpf_fd = setup the bpf stuff; /* no privilege required, unless >>> the program is huge or needs is_priv */ >>> bpf(BPF_IMPROVED_ATTACH, target = fd, program = bpf_fd); >>> >>> there would be no capabilities or global privilege at all required for >>> this. It would just work with cgroup delegation, containers, etc. >>> >>> I think you could even pull off this type of API change with only >>> libbpf changes. In particular, there's this code: >>> >>> int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type, >>> unsigned int flags) >>> { >>> union bpf_attr attr; >>> >>> memset(&attr, 0, sizeof(attr)); >>> attr.target_fd = target_fd; >>> attr.attach_bpf_fd = prog_fd; >>> attr.attach_type = type; >>> attr.attach_flags = flags; >>> >>> return sys_bpf(BPF_PROG_ATTACH, &attr, sizeof(attr)); >>> } >>> >>> This would instead do something like: >>> >>> int specific_target_fd = openat(target_fd, bpf_type_to_target[type], >>> O_RDWR); >>> attr.target_fd = specific_target_fd; >>> ... >>> >>> return sys_bpf(BPF_PROG_IMPROVED_ATTACH, &attr, sizeof(attr)); >>> >>> Would this solve your problem without needing /dev/bpf at all? >> >> This gives fine grain access control. I think it solves the problem. >> But it also requires a lot of rework to sys_bpf(). And it may also >> break backward/forward compatibility? >> > > I think the compatibility issue is manageable. The current bpf() interface > would be supported for at least several years, and libbpf could detect that > the new interface isn’t supported and fall back the old interface You are right. New BPF_PROG_IMPROVED_ATTACH helps compatibility. I missed that part. > >> Personally, I think it is an overkill for the original motivation: >> call sys_bpf() with special user instead of root. > > It’s overkill for your specific use case, but I’m trying to encourage you to > either solve your problem entirely in userspace or to solve a more general > problem in the kernel :) I do like both proposals. Thanks for these invaluable suggestions. > > In furtherance of bpf’s goal of world domination, I think it would be great > if it Just Worked in a container. My proposal does this. Let me think more about this and discuss with the team. Thanks again, Song