Hi Daniel,

On Fri, Nov 20, 2020 at 11:56 PM Daniel Borkmann <dan...@iogearbox.net> wrote:
>
> On 11/20/20 6:39 PM, Marcel Apfelbaum wrote:
> > +netdev
> > [Sorry for the second email, I failed to set the text-only mode]
> > On Fri, Nov 20, 2020 at 7:30 PM Marcel Apfelbaum <mapfe...@redhat.com> 
> > wrote:
> [...]
> >>> ---------- Forwarded message ----------
> >>> From: Jakub Kicinski <k...@kernel.org>
> >>> To: Paolo Abeni <pab...@redhat.com>
> >>> Cc: Saeed Mahameed <sa...@kernel.org>, netdev@vger.kernel.org, Jonathan 
> >>> Corbet <cor...@lwn.net>, "David S. Miller" <da...@davemloft.net>, Shuah 
> >>> Khan <sh...@kernel.org>, linux-...@vger.kernel.org, 
> >>> linux-kselft...@vger.kernel.org, Marcelo Tosatti <mtosa...@redhat.com>, 
> >>> Daniel Borkmann <dan...@iogearbox.net>
> >>> Bcc:
> >>> Date: Wed, 4 Nov 2020 11:42:26 -0800
> >>> Subject: Re: [PATCH net-next v2 0/3] net: introduce rps_default_mask
> >>> On Wed, 04 Nov 2020 18:36:08 +0100 Paolo Abeni wrote:
> >>>> On Tue, 2020-11-03 at 08:52 -0800, Jakub Kicinski wrote:
> >>>>> On Tue, 03 Nov 2020 16:22:07 +0100 Paolo Abeni wrote:
> >>>>>> The relevant use case is an host running containers (with the related
> >>>>>> orchestration tools) in a RT environment. Virtual devices (veths, ovs
> >>>>>> ports, etc.) are created by the orchestration tools at run-time.
> >>>>>> Critical processes are allowed to send packets/generate outgoing, it 
> >>>>>> gets a network-interface
> >>>> upstart job just as it does on a real host.
> >>>>
> >>>>>> network traffic - but any interrupt is moved away from the related
> >>>>>> cores, so that usual incoming network traffic processing does not
> >>>>>> happen there.
> >>>>>>
> >>>>>> Still an xmit operation on a virtual devices may be transmitted via ovs
> >>>>>> or veth, with the relevant forwarding operation happening in a softirq
> >>>>>> on the same CPU originating the packet.
> >>>>>>
> >>>>>> RPS is configured (even) on such virtual devices to move away the
> >>>>>> forwarding from the relevant CPUs.
> >>>>>>
> >>>>>> As Saeed noted, such configuration could be possibly performed via some
> >>>>>> user-space daemon monitoring network devices and network namespaces
> >>>>>> creation. That will be anyway prone to some race: the orchestation tool
> >>>>>> may create and enable the netns and virtual devices before the daemon
> >>>>>> has properly set the RPS mask.
> >>>>>>
> >>>>>> In the latter scenario some packet forwarding could still slip in the
> >>>>>> relevant CPU, causing measurable latency. In all non RT scenarios the
> >>>>>> above will be likely irrelevant, but in the RT context that is not
> >>>>>> acceptable - e.g. it causes in real environments latency above the
> >>>>>> defined limits, while the proposed patches avoid the issue.
> >>>>>>
> >>>>>> Do you see any other simple way to avoid the above race?
> >>>>>>
> >>>>>> Please let me know if the above answers your doubts,
> >>>>>
> >>>>> Thanks, that makes it clearer now.
> >>>>>
> >>>>> Depending on how RT-aware your container management is it may or may not
> >>>>> be the right place to configure this, as it creates the veth interface.
> >>>>> Presumably it's the container management which does the placement of
> >>>>> the tasks to cores, why is it not setting other attributes, like RPS?
> >>
> >> The CPU isolation is done statically at system boot by setting Linux 
> >> kernel parameters,
> >> So the container management component, in this case the Machine 
> >> Configuration Operator (for Openshift)
> >> or the K8s counterpart can't really help. (actually they would help if a 
> >> global RPS mask would exist)
> >>
> >> I tried to tweak the rps_cpus mask using the container management stack, 
> >> but there
> >> is no sane way to do it, please let me get a little into the details.
> >>
> >> The k8s orchestration component that deals with injecting the network 
> >> device(s) into the
> >> container is CNI, which is interface based and implemented by a lot of 
> >> plugins, making it
> >> hardly feasible to go over all the existing plugins and change them. Also 
> >> what about
> >> the 3rd party ones?
> >>
> >> Writing a new CNI plugin and chain it into the existing one is also not an 
> >> option AFAIK,
> >> they work at the network level and do not have access to sysfs (they 
> >> handle the network namespaces).
> >> Even if it would be possible (I don't have a deep CNI understanding), it 
> >> will require a cluster global configuration
> >> that is actually needed only for some of the cluster nodes.
>
> CNI chaining would be ugly, agree, but in a typical setting you'd have the 
> CNI plugin
> itself which is responsible to set up the Pod for communication to the 
> outside world;
> part of it would be creation of devices and moving them into the target 
> netns, and
> then you also typically have an agent running in kube-system namespace in the 
> hostns
> to which the CNI plugin talks to via IPC e.g. to set up IPAM and other state. 
> Such
> agent usually sets up all sort of knobs for the networking layer upon 
> bootstrap.

The main issue is that CNI is networking related, but the way to set
the RPS is by writing to /sys which is not considered network namespace
related and is read only inside the containers.

> Assuming you have a cluster where only some of the nodes have RT kernel, 
> these would
> likely have special node annotations in K8s so you could select them to run 
> certain
> workloads on it.. couldn't such agent be taught to be RT-aware and set up all 
> the
> needed knobs?

I do agree this part may be doable, sadly is by far not the biggest problem.

> Agree it's a bit ugly to change the relevant CNI plugins to be RT-aware,
> but what if you also need other settings in future aside from RPS mask for 
> RT? At some
> point you'd likely end up having to extend these anyway, no?
>

All networking changes are fair play, however setting the RPS mask
is related to networking but not a networking operation per se - is a
cross-domain operation (network namespace/mount namespace).

Thank you for your response,
Marcel

[...]

Reply via email to