On Fri, 21 Aug 2020 21:01:50 +0200 Felix Fietkau wrote: > For some drivers (especially 802.11 drivers), doing a lot of work in the NAPI > poll function does not perform well. Since NAPI poll is bound to the CPU it > was scheduled from, we can easily end up with a few very busy CPUs spending > most of their time in softirq/ksoftirqd and some idle ones. > > Introduce threaded NAPI for such drivers based on a workqueue. The API is the > same except for using netif_threaded_napi_add instead of netif_napi_add. > > In my tests with mt76 on MT7621 using threaded NAPI + a thread for tx > scheduling > improves LAN->WLAN bridging throughput by 10-50%. Throughput without threaded > NAPI is wildly inconsistent, depending on the CPU that runs the tx scheduling > thread. > > With threaded NAPI, throughput seems stable and consistent (and higher than > the best results I got without it). > > Based on a patch by Hillf Danton
I've tested this patch on a non-NUMA system with a moderately high-network workload (roughly 1:6 network to compute cycles) - and it provides ~2.5% speedup in terms of RPS but 1/6/10% worse P50/P99/P999 latency. I started working on a counter-proposal which uses a pool of threads dedicated to NAPI polling. It's not unlike the workqueue code but trying to be a little more clever. It gives me ~6.5% more RPS but at the same time lowers the p99 latency by 35% without impacting other percentiles. (I only started testing this afternoon, so hopefully the numbers will improve further). I'm happy for this patch to be merged, it's quite nice, but I wanted to give the heads up that I may have something that would replace it... The extremely rough PoC, less than half-implemented code which is really too broken to share: https://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux.git/log/?h=tapi