On Wed, Mar 14, 2018 at 11:00 AM, Alexei Starovoitov <alexei.starovoi...@gmail.com> wrote: > >> It seems this is exactly the case where a netns would be the correct answer. > > Unfortuantely that's not the case. That's what I tried to explain > in the cover letter: > "The setup involves per-container IPs, policy, etc, so traditional > network-only solutions that involve VRFs, netns, acls are not applicable." > To elaborate more on that: > netns is l2 isolation. > vrf is l3 isolation. > whereas to containerize an application we need to punch connectivity holes > in these layered techniques. > We also considered resurrecting Hannes's afnetns work > and even went as far as designing a new namespace for L4 isolation. > Unfortunately all hierarchical namespace abstraction don't work. > To run an application inside cgroup container that was not written > with containers in mind we have to make an illusion of running > in non-containerized environment. > In some cases we remember the port and container id in the post-bind hook > in a bpf map and when some other task in a different container is trying > to connect to a service we need to know where this service is running. > It can be remote and can be local. Both client and service may or may not > be written with containers in mind and this sockaddr rewrite is providing > connectivity and load balancing feature that you simply cannot do > with hierarchical networking primitives.
have to explain this a bit further... We also considered hacking these 'connectivity holes' in netns and/or vrf, but that would be real layering violation, since clean l2, l3 abstraction would suddenly support something that breaks through the layers. Just like many consider ipvlan a bad hack that punches through the layers and connects l2 abstraction of netns at l3 layer, this is not something kernel should ever do. We really didn't want another ipvlan-like hack in the kernel. Instead bpf programs at bind/connect time _help_ applications discover and connect to each other. All containers are running in init_nens and there are no vrfs. After bind/connect the normal fib/neighbor core networking logic works as it should always do. The whole system is clean from network point of view.