Hi Daniel, On Fri, Nov 20, 2020 at 11:56 PM Daniel Borkmann <dan...@iogearbox.net> wrote: > > On 11/20/20 6:39 PM, Marcel Apfelbaum wrote: > > +netdev > > [Sorry for the second email, I failed to set the text-only mode] > > On Fri, Nov 20, 2020 at 7:30 PM Marcel Apfelbaum <mapfe...@redhat.com> > > wrote: > [...] > >>> ---------- Forwarded message ---------- > >>> From: Jakub Kicinski <k...@kernel.org> > >>> To: Paolo Abeni <pab...@redhat.com> > >>> Cc: Saeed Mahameed <sa...@kernel.org>, netdev@vger.kernel.org, Jonathan > >>> Corbet <cor...@lwn.net>, "David S. Miller" <da...@davemloft.net>, Shuah > >>> Khan <sh...@kernel.org>, linux-...@vger.kernel.org, > >>> linux-kselft...@vger.kernel.org, Marcelo Tosatti <mtosa...@redhat.com>, > >>> Daniel Borkmann <dan...@iogearbox.net> > >>> Bcc: > >>> Date: Wed, 4 Nov 2020 11:42:26 -0800 > >>> Subject: Re: [PATCH net-next v2 0/3] net: introduce rps_default_mask > >>> On Wed, 04 Nov 2020 18:36:08 +0100 Paolo Abeni wrote: > >>>> On Tue, 2020-11-03 at 08:52 -0800, Jakub Kicinski wrote: > >>>>> On Tue, 03 Nov 2020 16:22:07 +0100 Paolo Abeni wrote: > >>>>>> The relevant use case is an host running containers (with the related > >>>>>> orchestration tools) in a RT environment. Virtual devices (veths, ovs > >>>>>> ports, etc.) are created by the orchestration tools at run-time. > >>>>>> Critical processes are allowed to send packets/generate outgoing, it > >>>>>> gets a network-interface > >>>> upstart job just as it does on a real host. > >>>> > >>>>>> network traffic - but any interrupt is moved away from the related > >>>>>> cores, so that usual incoming network traffic processing does not > >>>>>> happen there. > >>>>>> > >>>>>> Still an xmit operation on a virtual devices may be transmitted via ovs > >>>>>> or veth, with the relevant forwarding operation happening in a softirq > >>>>>> on the same CPU originating the packet. > >>>>>> > >>>>>> RPS is configured (even) on such virtual devices to move away the > >>>>>> forwarding from the relevant CPUs. > >>>>>> > >>>>>> As Saeed noted, such configuration could be possibly performed via some > >>>>>> user-space daemon monitoring network devices and network namespaces > >>>>>> creation. That will be anyway prone to some race: the orchestation tool > >>>>>> may create and enable the netns and virtual devices before the daemon > >>>>>> has properly set the RPS mask. > >>>>>> > >>>>>> In the latter scenario some packet forwarding could still slip in the > >>>>>> relevant CPU, causing measurable latency. In all non RT scenarios the > >>>>>> above will be likely irrelevant, but in the RT context that is not > >>>>>> acceptable - e.g. it causes in real environments latency above the > >>>>>> defined limits, while the proposed patches avoid the issue. > >>>>>> > >>>>>> Do you see any other simple way to avoid the above race? > >>>>>> > >>>>>> Please let me know if the above answers your doubts, > >>>>> > >>>>> Thanks, that makes it clearer now. > >>>>> > >>>>> Depending on how RT-aware your container management is it may or may not > >>>>> be the right place to configure this, as it creates the veth interface. > >>>>> Presumably it's the container management which does the placement of > >>>>> the tasks to cores, why is it not setting other attributes, like RPS? > >> > >> The CPU isolation is done statically at system boot by setting Linux > >> kernel parameters, > >> So the container management component, in this case the Machine > >> Configuration Operator (for Openshift) > >> or the K8s counterpart can't really help. (actually they would help if a > >> global RPS mask would exist) > >> > >> I tried to tweak the rps_cpus mask using the container management stack, > >> but there > >> is no sane way to do it, please let me get a little into the details. > >> > >> The k8s orchestration component that deals with injecting the network > >> device(s) into the > >> container is CNI, which is interface based and implemented by a lot of > >> plugins, making it > >> hardly feasible to go over all the existing plugins and change them. Also > >> what about > >> the 3rd party ones? > >> > >> Writing a new CNI plugin and chain it into the existing one is also not an > >> option AFAIK, > >> they work at the network level and do not have access to sysfs (they > >> handle the network namespaces). > >> Even if it would be possible (I don't have a deep CNI understanding), it > >> will require a cluster global configuration > >> that is actually needed only for some of the cluster nodes. > > CNI chaining would be ugly, agree, but in a typical setting you'd have the > CNI plugin > itself which is responsible to set up the Pod for communication to the > outside world; > part of it would be creation of devices and moving them into the target > netns, and > then you also typically have an agent running in kube-system namespace in the > hostns > to which the CNI plugin talks to via IPC e.g. to set up IPAM and other state. > Such > agent usually sets up all sort of knobs for the networking layer upon > bootstrap.
The main issue is that CNI is networking related, but the way to set the RPS is by writing to /sys which is not considered network namespace related and is read only inside the containers. > Assuming you have a cluster where only some of the nodes have RT kernel, > these would > likely have special node annotations in K8s so you could select them to run > certain > workloads on it.. couldn't such agent be taught to be RT-aware and set up all > the > needed knobs? I do agree this part may be doable, sadly is by far not the biggest problem. > Agree it's a bit ugly to change the relevant CNI plugins to be RT-aware, > but what if you also need other settings in future aside from RPS mask for > RT? At some > point you'd likely end up having to extend these anyway, no? > All networking changes are fair play, however setting the RPS mask is related to networking but not a networking operation per se - is a cross-domain operation (network namespace/mount namespace). Thank you for your response, Marcel [...]