On 6/26/2020 5:48 AM, Maxim Mikityanskiy wrote:
Thanks a lot for your reply! It was really helpful. I have a few
comments, please see below.
On 2020-06-24 23:21, Samudrala, Sridhar wrote:
On 6/17/2020 6:15 AM, Maxim Mikityanskiy wrote:
Hi,
I discovered Intel ADQ feature [1] that allows to boost performance by
picking dedicated queues for application traffic. We did some
research, and I got some level of understanding how it works, but I
have some questions, and I hope you could answer them.
1. SO_INCOMING_NAPI_ID usage. In my understanding, every connection
has a key (sk_napi_id) that is unique to the NAPI where this
connection is handled, and the application uses that key to choose a
handler thread from the thread pool. If we have a one-to-one
relationship between application threads and NAPI IDs of connections,
each application thread will handle only traffic from a single NAPI.
Is my understanding correct?
Yes. It is correct and recommended with the current implementation.
1.1. I wonder how the application thread gets scheduled on the same
core that NAPI runs at. It currently only works with busy_poll, so
when the application initiates busy polling (calls epoll), does the
Linux scheduler move the thread to the right CPU? Do we have to have a
strict one-to-one relationship between threads and NAPIs, or can one
thread handle multiple NAPIs? When the data arrives, does the
scheduler run the application thread on the same CPU that NAPI ran on?
The app thread can do busypoll from any core and there is no requirement
that the scheduler needs to move the thread to a specific CPU.
If the NAPI processing happens via interrupts, the scheduler could move
the app thread to the same CPU that NAPI ran on.
1.2. I see that SO_INCOMING_NAPI_ID is tightly coupled with busy_poll.
It is enabled only if CONFIG_NET_RX_BUSY_POLL is set. Is there a real
reason why it can't be used without busy_poll? In other words, if we
modify the kernel to drop this requirement, will the kernel still
schedule the application thread on the same CPU as NAPI when busy_poll
is not used?
It should be OK to remove this restriction, but requires enabling this
in skb_mark_napi_id() and sk_mark_napi_id() too.
2. Can you compare ADQ to aRFS+XPS? aRFS provides a way to steer
traffic to the application's CPU in an automatic fashion, and xps_rxqs
can be used to transmit from the corresponding queues. This setup
doesn't need manual configuration of TCs and is not limited to 4
applications. The difference of ADQ is that (in my understanding) it
moves the application to the RX CPU, while aRFS steers the traffic to
the RX queue handled my the application's CPU. Is there any advantage
of ADQ over aRFS, that I failed to find?
aRFS+XPS ties app thread to a cpu,
Well, not exactly. To pin the app thread to a CPU, one uses
taskset/sched_setaffinity, while aRFS+XPS pick a queue that corresponds
to that CPU.
Yes. I should have said XPS-cpus ties app thread to a cpu and aRFS maps
that cpu to rx queue.
whereas ADQ ties app thread to a napi
id which in turn ties to a queue(s)
So, basically, both technologies result in making NAPI and the app run
on the same CPU. The difference that I see is that ADQ forces NAPI
processing (in busy polling) on the app's CPU, while aRFS steers the
traffic to a queue, whose NAPI runs on the app's CPU. The effect is the
same, but ADQ requires busy polling. Is my understanding correct?
'busy polling' is not a requirement. It is possible to use ADQ receive
filters with XPS based on rx queues to make NAPI and the app run on the
same CPU without busy polling.
ADQ also provides 2 levels of filtering compared to aRFS+XPS. The first
level of filtering selects a queue-set associated with the application
and the second level filter or RSS will select a queue within that queue
set associated with an app thread.
This difference looks important. So, ADQ reserves a dedicated set of
queues solely for the application use.
The current interface to configure ADQ limits us to support upto 16
application specific queue sets(TC_MAX_QUEUE)
From the commit message:
https://patchwork.ozlabs.org/project/netdev/patch/20180214174539.11392-5-jeffrey.t.kirs...@intel.com/
I got that i40e supports up to 4 groups. Has this limitation been
lifted, or are you saying that 16 is the limitation of mqprio, while the
driver may support fewer? Or is it different for different Intel drivers?
Yes. That patch is enabling support for ADQ in a VF and it is currently
limited to 4 queue groups. But 16 is tc mqprio interface limitation.
3. At [1], you mention that ADQ can be used to create separate RSS
sets. Could you elaborate about the API used? Does the tc mqprio
configuration also affect RSS? Can it be turned on/off?
Yes. tc mqprio allows to create queue-sets per application and the
driver configures RSS per queue-set.
4. How is tc flower used in context of ADQ? Does the user need to
reflect the configuration in both mqprio qdisc (for TX) and tc flower
(for RX)? It looks like tc flower maps incoming traffic to TCs, but
what is the mechanism of mapping TCs to RX queues?
tc mqprio is used to map TCs to RX queues
OK, I got how the configuration works now, thanks! Though I'm not sure
mqprio is the best API to configure the RX side. I thought it's supposed
to configure the TX queues. Looks more like a hack to me.
Ethtool RSS context API (look for "context" in man ethtool) seems more
appropriate for the RX side for this purpose.
Thanks, we will explore if ethtool will work for us. We went with mqprio
so that we can configure both TX/RX queue-sets together rather than
splitting the configuration into 2 steps.
Thanks,
Max
tc flower is used to configure the first level of filter to redirect
packets to a queue set associated with an application.
I really hope you will be able to shed more light on this feature to
increase my awareness on how to use it and to compare it with aRFS.
Hope this helps and we will go over in more detail in our netdev session.
Thanks,
Max
[1]:
https://netdevconf.info/0x14/session.html?talk-ADQ-for-system-level-network-io-performance-improvements