On Tue, 29 Sep 2020 13:16:59 -0700 Wei Wang wrote: > On Tue, Sep 29, 2020 at 12:19 PM Jakub Kicinski <k...@kernel.org> wrote: > > On Mon, 28 Sep 2020 19:43:36 +0200 Eric Dumazet wrote: > > > Wei, this is a very nice work. > > > > > > Please re-send it without the RFC tag, so that we can hopefully merge it > > > ASAP. > > > > The problem is for the application I'm testing with this implementation > > is significantly slower (in terms of RPS) than Felix's code: > > > > | L A T E N C Y | App | C P U | > > | RPS | AVG | P50 | P99 | P999 | Overld | busy | PSI | > > thread | 1.1% | -15.6% | -0.3% | -42.5% | -8.1% | -83.4% | -2.3% | 60.6% | > > work q | 4.3% | -13.1% | 0.1% | -44.4% | -1.1% | 2.3% | -1.2% | 90.1% | > > TAPI | 4.4% | -17.1% | -1.4% | -43.8% | -11.0% | -60.2% | -2.3% | 46.7% | > > > > thread is this code, "work q" is Felix's code, TAPI is my hacks. > > > > The numbers are comparing performance to normal NAPI. > > > > In all cases (but not the baseline) I configured timer-based polling > > (defer_hard_irqs), with around 100us timeout. Without deferring hard > > IRQs threaded NAPI is actually slower for this app. Also I'm not > > modifying niceness, this again causes application performance > > regression here. > > > > If I remember correctly, Felix's workqueue code uses HIGHPRIO flag > which by default uses -20 as the nice value for the workqueue threads. > But the kthread implementation leaves nice level as 20 by default. > This could be 1 difference.
FWIW this is the data based on which I concluded the nice -20 actually makes things worse here: threded: -1.50% threded p-20: -5.67% thr poll: 2.93% thr poll p-20: 2.22% Annoyingly relative performance change varies day to day and this test was run a while back (over the weekend I was getting < 2% improvement with this set). > I am not sure what the benchmark is doing Not a benchmark, real workload :) > but one thing to try is to limit the CPUs that run the kthreads to a > smaller # of CPUs. This could bring up the kernel cpu usage to a > higher %, e.g. > 80%, so the scheduler is less likely to schedule > user threads on these CPUs, thus providing isolations between > kthreads and the user threads, and reducing the scheduling overhead. Yeah... If I do pinning or isolation I can get to 15% RPS improvement for this application.. no threaded NAPI needed. The point for me is to not have to do such tuning per app x platform x workload of the day. > This could help if the throughput drop is caused by higher scheduling > latency for the user threads. Another thing to try is to raise the > scheduling class of the kthread from SCHED_OTHER to SCHED_FIFO. This > could help if the throughput drop is caused by the kthreads > experiencing higher scheduling latency. Isn't the fundamental problem that scheduler works at ms scale while where we're talking about 100us at most? And AFAICT scheduler doesn't have a knob to adjust migration cost per process? :( I just reached out to the kernel experts @FB for their input. Also let me re-run with a normal prio WQ. > > 1 NUMA node. 18 NAPI instances each is around 25% of a single CPU. > > > > I was initially hoping that TAPI would fit nicely as an extension > > of this code, but I don't think that will be the case. > > > > Are there any assumptions you're making about the configuration that > > I should try to replicate?