On 10/21/15 at 05:17pm, Daniel Borkmann wrote: > On 10/20/2015 08:56 PM, Eric W. Biederman wrote: > ... > >Just FYI: Using a device for this kind of interface is pretty > >much a non-starter as that quickly gets you into situations where > >things do not work in containers. If someone gets a version of device > >namespaces past GregKH it might be up for discussion to use character > >devices. > > Okay, you are referring to this discussion here: > > http://thread.gmane.org/gmane.linux.kernel.containers/26760 > > What had been mentioned earlier in this thread was to have a namespace > pass-through facility enforced by device cgroups we have in the kernel, > which is one out of various means used to enforce policy today by > deployment systems such as docker, for example. But more below. > > I think this all depends on the kind of expectations we have, where all > this is going. In the original proposal, it was agreed to have the > operation that creates a node as 'capable(CAP_SYS_ADMIN)'-only (in the > way like most of the rest of eBPF is restricted), and based on the use > case we distribute such objects to unprivileged applications. But I > understand that it seems the trend lately to lift eBPF restrictions at > some point anyway, and thus the CAP_SYS_ADMIN is suddenly irrelevant > again. Fair enough. > > Don't get me wrong, I really don't mind if it will be some version of > this fs patch or whatever architecture else we find consensus on, I > think this discussion is merely trying to evaluate/discuss on what seems > to be a good fit, also in terms of future requirements and integration. > > So far, during this discussion, it was proposed to modify the file system > to a single-mount one and to stick this under /sys/kernel/bpf/. This > will not have "real" namespace support either, but it was proposed to > have a following structure: > > /sys/kernel/bpf/username/<optional_dirs_mkdir_by_user>/progX
This would probably work as you would typically map the ebpf map using -v like this to give a stable path: docker run -v /sys/kernel/bpf/foo/maps/progX:/map proX > So, the file system will have kind of a user home-directory for each user > to isolate through permissions, if I understood correctly. > > If we really want to go this route, then I think there are no big stones > in the way for the other model either. It should look roughly drafted like > the below. > > Together with device cgroups for containers, it would allow scenarios where > you can have: > > * eBPF (map/prog) device pass-through so a map/prog could even be shared out > from the initial namespace into individual ones/all (one could possibly > extend such maps as read-only for these consumers). > * eBPF device creation for unprivileged users with permissions being set > accordingly (as in fs case). > * Since cgroup controller can also do wildcards on major/minors, we could > make that further fine-grained. > * eBPF device creation can also be enforced by the cgroup controller to be > entirely disallowed for a specific container. > > (An admin can determine the dynamically created major f.e. under > /proc/devices.) I've read the discussion passively and my take away is that, frankly, I think the differences are somewhat minor. Both architectures can scale to what we need. Both will do the job. I'm slightly worried about exposing uAPI as a FS, I think that didn't work too well for sysfs. It's pretty much a define the format once and never touch it again kind of deal. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html