On Tue, Mar 13, 2018 at 11:21:08PM -0700, Eric Dumazet wrote: > > If I understand well, strace(1) will not show the real (after modification > by eBPF) IP/port ?
correct. Just like it won't show anything after syscall entry, whether lsm acted, seccomp, etc > What about selinux and other LSM ? clearly lsm is not place to do ip/port enforcement for containers. lsm in general is missing post-bind lsm hook and visibility in cgroups. This patch set is not about policy, but more about connectivity. That's why sockaddr rewrite is must have. > We have now network namespaces for full isolation. Soon ILA will come. we're already using a form of ila. That's orthogonal to this feature. > The argument that it is not convenient (or even possible) to change the > application or using modern isolation is quite strange, considering the just like any other datacenter there are thousands of third party applications that we cannot control. Including open source code written by google. Would golang switch to use glibc? I very much doubt. Statically linked apps also don't work with ld_preload. > added burden/complexity/bloat to the kernel. bloat? that's very odd to hear. bpf is very much anti-bloat technique. If you were serious with that comment, please argue with tracing folks who add thousand upon thousand lines of code to the kernel to do hard coded things while bpf already does all that and more without any extra kernel code. > The post hook for sys_bind is clearly a failure of the model, since > releasing the port might already be too late, another thread might fail to > get it during a non zero time window. I suspect commit log wasn't clear. In post-bind hook we don't release the port, we only fail sys_bind and user space will eventually close the socket and release the port. I don't think it's safe to call inet_put_port() here. It is also racy as you pointed out. > If you want to provide an alternate port allocation strategy, better provide > a correct eBPF hook. right. that's another separate work indepedent from this feature. port allocation/free from bpf via helper is also necessary, but for different use case. > It seems this is exactly the case where a netns would be the correct answer. Unfortuantely that's not the case. That's what I tried to explain in the cover letter: "The setup involves per-container IPs, policy, etc, so traditional network-only solutions that involve VRFs, netns, acls are not applicable." To elaborate more on that: netns is l2 isolation. vrf is l3 isolation. whereas to containerize an application we need to punch connectivity holes in these layered techniques. We also considered resurrecting Hannes's afnetns work and even went as far as designing a new namespace for L4 isolation. Unfortunately all hierarchical namespace abstraction don't work. To run an application inside cgroup container that was not written with containers in mind we have to make an illusion of running in non-containerized environment. In some cases we remember the port and container id in the post-bind hook in a bpf map and when some other task in a different container is trying to connect to a service we need to know where this service is running. It can be remote and can be local. Both client and service may or may not be written with containers in mind and this sockaddr rewrite is providing connectivity and load balancing feature that you simply cannot do with hierarchical networking primitives. btw the per-container policy enforcement of ip+port via these hooks wasn't our planned feature. It was requested by other folks and we had to tweak the api a little bit to satisfy ours and theirs requirement.