> On Aug 15, 2019, at 5:54 PM, Andy Lutomirski <l...@amacapital.net> wrote: > > > >> On Aug 15, 2019, at 4:46 PM, Alexei Starovoitov >> <alexei.starovoi...@gmail.com> wrote: > > >>> >>> I'm not sure why you draw the line for VMs -- they're just as buggy >>> as anything else. Regardless, I reject this line of thinking: yes, >>> all software is buggy, but that isn't a reason to give up. >> >> hmm. are you saying you want kernel community to work towards >> making containers (namespaces) being able to run arbitrary code >> downloaded from the internet? > > Yes. > > As an example, Sandstorm uses a combination of namespaces (user, network, > mount, ipc) and a moderately permissive seccomp policy to run arbitrary code. > Not just little snippets, either — node.js, Mongo, MySQL, Meteor, and other > fairly heavyweight stacks can all run under Sandstorm, with the whole stack > (database engine binaries, etc) supplied by entirely untrusted customers. > During the time Sandstorm was under active development, I can recall *one* > bug that would have allowed a sandbox escape. That’s a pretty good track > record. (Also, Meltdown and Spectre, sigh.) > > To be clear, Sandstorm did not allow creation of a userns by the untrusted > code, and Sandstorm would have heavily restricted bpf(), but that should only > be necessary because of the possibility of kernel bugs, not because of the > overall design. > > Alexei, I’m trying to encourage you to aim for something even better than you > have now. Right now, if you grant a user various very strong capabilities, > that user’s systemd can use bpf network filters. Your proposal would allow > this with a different, but still very strong, set of capabilities. There’s > nothing wrong with this per se, but I think you can aim much higher: > > CAP_NET_ADMIN and your CAP_BPF both effectively allow the holder to take over > the system, *by design*. I’m suggesting that you engage the security > community (Kees, myself, Aleksa, Jann, Serge, Christian, etc) to aim for > something better: make it so that a normal Linux distro would be willing to > relax its settings enough so that normal users can use bpf filtering in the > systemd units and maybe eventually use even more bpf() capabilities. And > let’s make is to that mainstream container managers (that use userns!) will > be willing (as an option) to delegate bpf() to their containers. We’re happy > to help design, review, and even write code, but we need you to be willing to > work with us to make a design that seems like it will work and then to wait > long enough to merge it for us to think about it, try to poke holes in it, > and convince ourselves and each other that it has a good chance of being > sound. > > Obviously there will be many cases where an unprivileged program should *not* > be able to use bpf() IP filtering, but let’s make it so that enabling these > advanced features does not automatically give away the keys to the kingdom. > > (Sandstorm still exists but is no longer as actively developed, sadly.)
I am trying to understand different perspectives here. Disclaimer: Alexei and I both work for Facebook. But he may disagree with everything I am about to say below, because we haven't sync'ed about this for a while. :) I think there are two types of use cases here: 1. CAP_BPF_ADMIN: one big key to all sys_bpf(). 2. CAP_BPF: subset of sys_bpf() that is safe for containers. IIUC, currently, CAP_BPF_ADMIN is (almost) same as CAP_SYS_ADMIN. And there aren't many real world use cases for CAP_BPF. The /dev/bpf patch tries to separate CAP_BPF_ADMIN from CAP_SYS_ADMIN. On the other hand, Andy would like to introduce CAP_BPF and build amazing use cases around it (chicken-egg problem). Did I misunderstand anything? If not, I think these two use cases do not really conflict with each other, and we probably need both of them. Then, the next question is do we really need both/either of them. Maybe having two separate discussions would make it easier? The following are some questions I am trying to understand for the two cases. For CAP_BPF_ADMIN (or /dev/bpf): Can we just use CAP_NET_ADMIN? It is safer than CAP_SYS_ADMIN, and reuse existing CAP_ should be easier than introducing a new one? For CAP_BPF: Do we really need it for the containers? Is it possible to implement all container use cases with SUID? At this moment, I think SUID is the right way to go for this use case, because this is likely to start with a small set of functionalities. We can introduce CAP_BPF when the container use case is too complicated for SUID. I hope some of these questions/thoughts would make some sense? Thanks, Song