> On Jul 23, 2019, at 6:40 PM, Andy Lutomirski <l...@amacapital.net> wrote:
> 
> 
> 
>> On Jul 23, 2019, at 3:56 PM, Song Liu <songliubrav...@fb.com> wrote:
>> 
>> 
>> 
>>> On Jul 23, 2019, at 8:11 AM, Andy Lutomirski <l...@kernel.org> wrote:
>>> 
>>> On Mon, Jul 22, 2019 at 1:54 PM Song Liu <songliubrav...@fb.com> wrote:
>>>> 
>>>> Hi Andy, Lorenz, and all,
>>>> 
>>>>> On Jul 2, 2019, at 2:32 PM, Andy Lutomirski <l...@kernel.org> wrote:
>>>>> 
>>>>> On Tue, Jul 2, 2019 at 2:04 PM Kees Cook <keesc...@chromium.org> wrote:
>>>>>> 
>>>>>>> On Mon, Jul 01, 2019 at 06:59:13PM -0700, Andy Lutomirski wrote:
>>>>>>> I think I'm understanding your motivation.  You're not trying to make
>>>>>>> bpf() generically usable without privilege -- you're trying to create
>>>>>>> a way to allow certain users to access dangerous bpf functionality
>>>>>>> within some limits.
>>>>>>> 
>>>>>>> That's a perfectly fine goal, but I think you're reinventing the
>>>>>>> wheel, and the wheel you're reinventing is quite complicated and
>>>>>>> already exists.  I think you should teach bpftool to be secure when
>>>>>>> installed setuid root or with fscaps enabled and put your policy in
>>>>>>> bpftool.  If you want to harden this a little bit, it would seem
>>>>>>> entirely reasonable to add a new CAP_BPF_ADMIN and change some, but
>>>>>>> not all, of the capable() checks to check CAP_BPF_ADMIN instead of the
>>>>>>> capabilities that they currently check.
>>>>>> 
>>>>>> If finer grained controls are wanted, it does seem like the /dev/bpf
>>>>>> path makes the most sense. open, request abilities, use fd. The open can
>>>>>> be mediated by DAC and LSM. The request can be mediated by LSM. This
>>>>>> provides a way to add policy at the LSM level and at the tool level.
>>>>>> (i.e. For tool-level controls: leave LSM wide open, make /dev/bpf owned
>>>>>> by "bpfadmin" and bpftool becomes setuid "bpfadmin". For fine-grained
>>>>>> controls, leave /dev/bpf wide open and add policy to SELinux, etc.)
>>>>>> 
>>>>>> With only a new CAP, you don't get the fine-grained controls. (The
>>>>>> "request abilities" part is the key there.)
>>>>> 
>>>>> Sure you do: the effective set.  It has somewhat bizarre defaults, but
>>>>> I don't think that's a real problem.  Also, this wouldn't be like
>>>>> CAP_DAC_READ_SEARCH -- you can't accidentally use your BPF caps.
>>>>> 
>>>>> I think that a /dev capability-like object isn't totally nuts, but I
>>>>> think we should do it well, and this patch doesn't really achieve
>>>>> that.  But I don't think bpf wants fine-grained controls like this at
>>>>> all -- as I pointed upthread, a fine-grained solution really wants
>>>>> different treatment for the different capable() checks, and a bunch of
>>>>> them won't resemble capabilities or /dev/bpf at all.
>>>> 
>>>> With 5.3-rc1 out, I am back on this. :)
>>>> 
>>>> How about we modify the set as:
>>>> 1. Introduce sys_bpf_with_cap() that takes fd of /dev/bpf.
>>> 
>>> I'm fine with this in principle, but:
>>> 
>>>> 2. Better handling of capable() calls through bpf code. I guess the
>>>>   biggest problem here is is_priv in verifier.c:bpf_check().
>>> 
>>> I think it would be good to understand exactly what /dev/bpf will
>>> enable one to do.  Without some care, it would just become the next
>>> CAP_SYS_ADMIN: if you can open it, sure, you're not root, but you can
>>> intercept network traffic, modify cgroup behavior, and do plenty of
>>> other things, any of which can probably be used to completely take
>>> over the system.
>> 
>> Well, yes. sys_bpf() is pretty powerful. 
>> 
>> The goal of /dev/bpf is to enable special users to call sys_bpf(). In 
>> the meanwhile, such users should not take down the whole system easily
>> by accident, e.g., with rm -rf /.
> 
> That’s easy, though — bpftool could learn to read /etc/bpfusers before 
> allowing ruid != 0.

This is a great idea! fscaps + /etc/bpfusers should do the trick. 

> 
>> 
>> It is similar to CAP_BPF_ADMIN, without really adding the CAP_.  
>> 
>> I think adding new CAP_ requires much more effort. 
>> 
> 
> A new CAP_ is straightforward — add the definition and change the max cap.
> 
>>> 
>>> It would also be nice to understand why you can't do what you need to
>>> do entirely in user code using setuid or fscaps.
>> 
>> It is not very easy to achieve the same control: only certain users can
>> run certain tools (bpftool, etc.). 
>> 
>> The closest approach I can find is:
>> 1. use libcap (pam_cap) to give CAP_SETUID to certain users;
>> 2. add setuid(0) to bpftool.
>> 
>> The difference between this approach and /dev/bpf is that certain users
>> would be able to run other tools that call setuid(). Though I am not 
>> sure how many tools call setuid(), and how risky they are. 
> 
> I think you’re misunderstanding me. Install bpftool with either the setuid 
> (S_ISUID) mode or with an appropriate fscap bit — see the setcap(8) manpage.
> 
> The downside of this approach is that it won’t work well in a container, and 
> containers are cool these days :)
> 
>> 
>>> 
>>> Finally, at risk of rehashing some old arguments, I'll point out that
>>> the bpf() syscall is an unusual design to begin with.  As an example,
>>> consider bpf_prog_attach().  Outside of bpf(), if I want to change the
>>> behavior of a cgroup, I would write to a file in
>>> /sys/kernel/cgroup/unified/whatever/, and normal DAC and MAC rules
>>> apply.  With bpf(), however, I just call bpf() to attach a program to
>>> the cgroup.  bpf() says "oh, you are capable(CAP_NET_ADMIN) -- go for
>>> it!".  Unless I missed something major, and I just re-read the code,
>>> there is no check that the caller has write or LSM permission to
>>> anything at all in cgroupfs, and the existing API would make it very
>>> awkward to impose any kind of DAC rules here.
>>> 
>>> So I think it might actually be time to repay some techincal debt and
>>> come up with a real fix.  As a less intrusive approach, you could see
>>> about requiring ownership of the cgroup directory instead of
>>> CAP_NET_ADMIN.  As a more intrusive but perhaps better approach, you
>>> could invert the logic to to make it work like everything outside of
>>> cgroup: add pseudo-files like bpf.inet_ingress to the cgroup
>>> directories, and require a writable fd to *that* to a new improved
>>> attach API.  If a user could do:
>>> 
>>> int fd = open("/sys/fs/cgroup/.../bpf.inet_attach", O_RDWR);  /* usual
>>> DAC and MAC policy applies */
>>> int bpf_fd = setup the bpf stuff;  /* no privilege required, unless
>>> the program is huge or needs is_priv */
>>> bpf(BPF_IMPROVED_ATTACH, target = fd, program = bpf_fd);
>>> 
>>> there would be no capabilities or global privilege at all required for
>>> this.  It would just work with cgroup delegation, containers, etc.
>>> 
>>> I think you could even pull off this type of API change with only
>>> libbpf changes.  In particular, there's this code:
>>> 
>>> int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type,
>>>                  unsigned int flags)
>>> {
>>>      union bpf_attr attr;
>>> 
>>>      memset(&attr, 0, sizeof(attr));
>>>      attr.target_fd     = target_fd;
>>>      attr.attach_bpf_fd = prog_fd;
>>>      attr.attach_type   = type;
>>>      attr.attach_flags  = flags;
>>> 
>>>      return sys_bpf(BPF_PROG_ATTACH, &attr, sizeof(attr));
>>> }
>>> 
>>> This would instead do something like:
>>> 
>>> int specific_target_fd = openat(target_fd, bpf_type_to_target[type], 
>>> O_RDWR);
>>> attr.target_fd = specific_target_fd;
>>> ...
>>> 
>>> return sys_bpf(BPF_PROG_IMPROVED_ATTACH, &attr, sizeof(attr));
>>> 
>>> Would this solve your problem without needing /dev/bpf at all?
>> 
>> This gives fine grain access control. I think it solves the problem. 
>> But it also requires a lot of rework to sys_bpf(). And it may also 
>> break backward/forward compatibility?
>> 
> 
> I think the compatibility issue is manageable. The current bpf() interface 
> would be supported for at least several years, and libbpf could detect that 
> the new interface isn’t supported and fall back the old interface

You are right. New BPF_PROG_IMPROVED_ATTACH helps compatibility. 
I missed that part. 

> 
>> Personally, I think it is an overkill for the original motivation: 
>> call sys_bpf() with special user instead of root. 
> 
> It’s overkill for your specific use case, but I’m trying to encourage you to 
> either solve your problem entirely in userspace or to solve a more general 
> problem in the kernel :)

I do like both proposals. Thanks for these invaluable suggestions. 

> 
> In furtherance of bpf’s goal of world domination, I think it would be great 
> if it Just Worked in a container. My proposal does this.

Let me think more about this and discuss with the team. 

Thanks again, 
Song

Reply via email to