On Wed, 30 Sep 2020 10:58:00 +0200 Paolo Abeni wrote: > On Tue, 2020-09-29 at 14:48 -0700, Jakub Kicinski wrote: > > On Tue, 29 Sep 2020 13:16:59 -0700 Wei Wang wrote: > > > On Tue, Sep 29, 2020 at 12:19 PM Jakub Kicinski <k...@kernel.org> wrote: > > > > On Mon, 28 Sep 2020 19:43:36 +0200 Eric Dumazet wrote: > > > > > Wei, this is a very nice work. > > > > > > > > > > Please re-send it without the RFC tag, so that we can hopefully merge > > > > > it ASAP. > > > > > > > > The problem is for the application I'm testing with this implementation > > > > is significantly slower (in terms of RPS) than Felix's code: > > > > > > > > | L A T E N C Y | App | C P U > > > > | > > > > | RPS | AVG | P50 | P99 | P999 | Overld | busy | > > > > PSI | > > > > thread | 1.1% | -15.6% | -0.3% | -42.5% | -8.1% | -83.4% | -2.3% | > > > > 60.6% | > > > > work q | 4.3% | -13.1% | 0.1% | -44.4% | -1.1% | 2.3% | -1.2% | > > > > 90.1% | > > > > TAPI | 4.4% | -17.1% | -1.4% | -43.8% | -11.0% | -60.2% | -2.3% | > > > > 46.7% | > > > > > > > > thread is this code, "work q" is Felix's code, TAPI is my hacks. > > > > > > > > The numbers are comparing performance to normal NAPI. > > > > > > > > In all cases (but not the baseline) I configured timer-based polling > > > > (defer_hard_irqs), with around 100us timeout. Without deferring hard > > > > IRQs threaded NAPI is actually slower for this app. Also I'm not > > > > modifying niceness, this again causes application performance > > > > regression here. > > > > > > > > > > If I remember correctly, Felix's workqueue code uses HIGHPRIO flag > > > which by default uses -20 as the nice value for the workqueue threads. > > > But the kthread implementation leaves nice level as 20 by default. > > > This could be 1 difference. > > > > FWIW this is the data based on which I concluded the nice -20 actually > > makes things worse here: > > > > threded: -1.50% > > threded p-20: -5.67% > > thr poll: 2.93% > > thr poll p-20: 2.22% > > > > Annoyingly relative performance change varies day to day and this test > > was run a while back (over the weekend I was getting < 2% improvement > > with this set). > > I'm assuming your application uses UDP as the transport protocol - raw > IP or packet socket should behave in the same way. I observed similar > behavior - that is unstable figures, and end-to-end tput decrease when > network stack get more cycles (or become faster) - when the bottle-neck > was in user-space processing[1]. > > You can double check you are hitting the same scenario observing the > UDP protocol stats (you should see higher drops figures with threaded > and even more with threded p-20, compared to the other impls). > > If you are hitting such scenario, you should be able to improve things > setting nice-20 to the user-space process, increasing the UDP socket > receive buffer size or enabling socket busy polling > (/proc/sys/net/core/busy_poll, I mean).
It's not UDP. The application has some logic to tell the load balancer to back off whenever it feels like it's not processing requests fast enough (App Overld in the table 2 emails back). That statistic is higher with p-20. Application latency suffers, too.