On 15/08/2016 05:09, Sargun Dhillon wrote: > On Mon, Aug 15, 2016 at 12:57:44AM +0200, Mickaël Salaün wrote: >> Our approaches have some common points (i.e. use eBPF in an LSM, stacked >> filters like seccomp) but I'm focused on a kind of unprivileged LSM (i.e. no >> CAP_SYS_ADMIN), to make standalone sandboxes, which brings more constraints >> (e.g. no use of unsafe functions like bpf_probe_read(), take care of >> privacy, >> SUID exec, stable ABI…). However, I don't want to handle resource limits, >> which should be the job of cgroups. >> > Kind of. Sometimes describing these resource limits is difficult. For > example, I > have a customer who is trying to restrict containers from burning up all the > ephemeral ports on the machine. In this, they have an incredibly elaborate > chain > of wiring to prevent a given container from connecting to the same (proto, > destip, destport) more than 1000 times. > > I'm unsure of how you'd model that in a cgroup.
This looks like a Netfilter rule. Have you tried applying this limitation with the connlimit module? > >> For now, I'm focusing on file-system access control which is one of the more >> complex system to properly filter. I also plan to support basic network >> access >> control. >> >> What you are trying to accomplish seems more related to a Netfilter >> extension >> (something like ipset but with eBPF maybe?). >> > I don't only want to do network access control, I also want to write to the > value once it's copied into kernel space. There are lot of benefits of doing > this at the syscall level, but the two primary ones are performance, and > capability. > > One of the biggest complaints with our current approach to filtering & load > balancing (iptables) is that it hides information. When people connect > through > the load balancer, they want to find out who they connected to, and without > some > high application-level mechanism, this isn't possible. On the other hand, if > we > just rewrite the destination address in the connect hook, we can pretty easily > allow them to do getpeername. What exactly is not doable with Netfilter (e.g. REDIRECT or TPROXY)? > > I'm curious about your filesystem access limiter. Do you have a way to make > it so > that a given container can only write, say, 100mb of data to disk? It's a filesystem access control. It doesn't deal with quota and is not focused on container but process hierarchies (which is more generic). What is not doable with a quota mount option? It may be more appropriate to enhance the VFS (or overlayfs) to apply this kind of limitation, if needed.
signature.asc
Description: OpenPGP digital signature