Am 22.08.2020 um 03:49 schrieb Jakub Kicinski:
On Fri, 21 Aug 2020 21:01:50 +0200 Felix Fietkau wrote:
For some drivers (especially 802.11 drivers), doing a lot of work in the NAPI
poll function does not perform well. Since NAPI poll is bound to the CPU it
was scheduled from, we can easily end up with a few very busy CPUs spending
most of their time in softirq/ksoftirqd and some idle ones.
Introduce threaded NAPI for such drivers based on a workqueue. The API is the
same except for using netif_threaded_napi_add instead of netif_napi_add.
In my tests with mt76 on MT7621 using threaded NAPI + a thread for tx scheduling
improves LAN->WLAN bridging throughput by 10-50%. Throughput without threaded
NAPI is wildly inconsistent, depending on the CPU that runs the tx scheduling
thread.
With threaded NAPI, throughput seems stable and consistent (and higher than
the best results I got without it).
Based on a patch by Hillf Danton
I've tested this patch on a non-NUMA system with a moderately
high-network workload (roughly 1:6 network to compute cycles)
- and it provides ~2.5% speedup in terms of RPS but 1/6/10% worse
P50/P99/P999 latency.
I started working on a counter-proposal which uses a pool of threads
dedicated to NAPI polling. It's not unlike the workqueue code but
trying to be a little more clever. It gives me ~6.5% more RPS but at
the same time lowers the p99 latency by 35% without impacting other
percentiles. (I only started testing this afternoon, so hopefully the
numbers will improve further).
I'm happy for this patch to be merged, it's quite nice, but I wanted
to give the heads up that I may have something that would replace it...
The extremely rough PoC, less than half-implemented code which is really
too broken to share:
https://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux.git/log/?h=tapi
looks interesting. keep going
Sebastian