> On Aug 15, 2019, at 5:54 PM, Andy Lutomirski <l...@amacapital.net> wrote:
> 
> 
> 
>> On Aug 15, 2019, at 4:46 PM, Alexei Starovoitov 
>> <alexei.starovoi...@gmail.com> wrote:
> 
> 
>>> 
>>> I'm not sure why you draw the line for VMs -- they're just as buggy
>>> as anything else. Regardless, I reject this line of thinking: yes,
>>> all software is buggy, but that isn't a reason to give up.
>> 
>> hmm. are you saying you want kernel community to work towards
>> making containers (namespaces) being able to run arbitrary code
>> downloaded from the internet?
> 
> Yes.
> 
> As an example, Sandstorm uses a combination of namespaces (user, network, 
> mount, ipc) and a moderately permissive seccomp policy to run arbitrary code. 
> Not just little snippets, either — node.js, Mongo, MySQL, Meteor, and other 
> fairly heavyweight stacks can all run under Sandstorm, with the whole stack 
> (database engine binaries, etc) supplied by entirely untrusted customers.  
> During the time Sandstorm was under active development, I can recall *one* 
> bug that would have allowed a sandbox escape. That’s a pretty good track 
> record.  (Also, Meltdown and Spectre, sigh.)
> 
> To be clear, Sandstorm did not allow creation of a userns by the untrusted 
> code, and Sandstorm would have heavily restricted bpf(), but that should only 
> be necessary because of the possibility of kernel bugs, not because of the 
> overall design.
> 
> Alexei, I’m trying to encourage you to aim for something even better than you 
> have now. Right now, if you grant a user various very strong capabilities, 
> that user’s systemd can use bpf network filters.  Your proposal would allow 
> this with a different, but still very strong, set of capabilities. There’s 
> nothing wrong with this per se, but I think you can aim much higher:
> 
> CAP_NET_ADMIN and your CAP_BPF both effectively allow the holder to take over 
> the system, *by design*.  I’m suggesting that you engage the security 
> community (Kees, myself, Aleksa, Jann, Serge, Christian, etc) to aim for 
> something better: make it so that a normal Linux distro would be willing to 
> relax its settings enough so that normal users can use bpf filtering in the 
> systemd units and maybe eventually use even more bpf() capabilities. And 
> let’s make is to that mainstream container managers (that use userns!) will 
> be willing (as an option) to delegate bpf() to their containers. We’re happy 
> to help design, review, and even write code, but we need you to be willing to 
> work with us to make a design that seems like it will work and then to wait 
> long enough to merge it for us to think about it, try to poke holes in it, 
> and convince ourselves and each other that it has a good chance of being 
> sound.
> 
> Obviously there will be many cases where an unprivileged program should *not* 
> be able to use bpf() IP filtering, but let’s make it so that enabling these 
> advanced features does not automatically give away the keys to the kingdom.
> 
> (Sandstorm still exists but is no longer as actively developed, sadly.)

I am trying to understand different perspectives here. 

Disclaimer: Alexei and I both work for Facebook. But he may disagree 
with everything I am about to say below, because we haven't sync'ed 
about this for a while. :)

I think there are two types of use cases here: 

    1. CAP_BPF_ADMIN: one big key to all sys_bpf(). 
    2. CAP_BPF: subset of sys_bpf() that is safe for containers.

IIUC, currently, CAP_BPF_ADMIN is (almost) same as CAP_SYS_ADMIN. 
And there aren't many real world use cases for CAP_BPF. 

The /dev/bpf patch tries to separate CAP_BPF_ADMIN from CAP_SYS_ADMIN.
On the other hand, Andy would like to introduce CAP_BPF and build
amazing use cases around it (chicken-egg problem). 

Did I misunderstand anything?

If not, I think these two use cases do not really conflict with each
other, and we probably need both of them. Then, the next question is 
do we really need both/either of them. Maybe having two separate 
discussions would make it easier?


The following are some questions I am trying to understand for 
the two cases. 

For CAP_BPF_ADMIN (or /dev/bpf):
Can we just use CAP_NET_ADMIN? It is safer than CAP_SYS_ADMIN, and
reuse existing CAP_ should be easier than introducing a new one? 

For CAP_BPF: 
Do we really need it for the containers? Is it possible to implement 
all container use cases with SUID? At this moment, I think SUID is 
the right way to go for this use case, because this is likely to 
start with a small set of functionalities. We can introduce CAP_BPF
when the container use case is too complicated for SUID. 


I hope some of these questions/thoughts would make some sense?

Thanks,
Song

Reply via email to