On Mon, May 18, 2020 at 7:16 AM Václav Zindulka <vaclav.zindu...@tlapnet.cz> wrote: > > On Sun, May 17, 2020 at 9:35 PM Cong Wang <xiyou.wangc...@gmail.com> wrote: > > > > On Fri, May 8, 2020 at 6:59 AM Václav Zindulka > > <vaclav.zindu...@tlapnet.cz> wrote: > > > > > > > > > > > > I tried to emulate your test case in my VM, here is the script I > > > > > > use: > > > > > > > > > > > > ==== > > > > > > ip li set dev dummy0 up > > > > > > tc qd add dev dummy0 root handle 1: htb default 1 > > > > > > for i in `seq 1 1000` > > > > > > do > > > > > > tc class add dev dummy0 parent 1:0 classid 1:$i htb rate 1mbit > > > > > > ceil 1.5mbit > > > > > > tc qd add dev dummy0 parent 1:$i fq_codel > > > > > > done > > > > > > > > > > > > time tc qd del dev dummy0 root > > > > > > ==== > > > > > > > > > > > > And this is the result: > > > > > > > > > > > > Before my patch: > > > > > > real 0m0.488s > > > > > > user 0m0.000s > > > > > > sys 0m0.325s > > > > > > > > > > > > After my patch: > > > > > > real 0m0.180s > > > > > > user 0m0.000s > > > > > > sys 0m0.132s > > > > > > > > > > My results with your test script. > > > > > > > > > > before patch: > > > > > /usr/bin/time -p tc qdisc del dev enp1s0f0 root > > > > > real 1.63 > > > > > user 0.00 > > > > > sys 1.63 > > > > > > > > > > after patch: > > > > > /usr/bin/time -p tc qdisc del dev enp1s0f0 root > > > > > real 1.55 > > > > > user 0.00 > > > > > sys 1.54 > > > > > > > > > > > This is an obvious improvement, so I have no idea why you didn't > > > > > > catch any difference. > > > > > > > > > > We use hfsc instead of htb. I don't know whether it may cause any > > > > > difference. I can provide you with my test scripts if necessary. > > > > > > > > Yeah, you can try to replace the htb with hfsc in my script, > > > > I didn't spend time to figure out hfsc parameters. > > > > > > class add dev dummy0 parent 1:0 classid 1:$i hfsc ls m1 0 d 0 m2 > > > 13107200 ul m1 0 d 0 m2 13107200 > > > > > > but it behaves the same as htb... > > > > > > > My point here is, if I can see the difference with merely 1000 > > > > tc classes, you should see a bigger difference with hundreds > > > > of thousands classes in your setup. So, I don't know why you > > > > saw a relatively smaller difference. > > > > > > I saw a relatively big difference. It was about 1.5s faster on my huge > > > setup which is a lot. Yet maybe the problem is caused by something > > > > What percentage? IIUC, without patch it took you about 11s, so > > 1.5s faster means 13% improvement for you? > > My whole setup needs 22.17 seconds to delete with an unpatched kernel. > With your patches applied it is 21.08. So it varies between 1 - 1.5s. > Improvement is about 5 - 6%.
Good to know. > > > > else? I thought about tx/rx queues. RJ45 ports have up to 4 tx and rx > > > queues. SFP+ interfaces have much higher limits. 8 or even 64 possible > > > queues. I've tried to increase the number of queues using ethtool from > > > 4 to 8 and decreased to 2. But there was no difference. It was about > > > 1.62 - 1.63 with an unpatched kernel and about 1.55 - 1.58 with your > > > patches applied. I've tried it for ifb and RJ45 interfaces where it > > > took about 0.02 - 0.03 with an unpatched kernel and 0.05 with your > > > patches applied, which is strange, but it may be caused by the fact it > > > was very fast even before. > > > > That is odd. In fact, this is highly related to number of TX queues, > > because the existing code resets the qdisc's once for each TX > > queue, so the more TX queues you have, the more resets kernel > > will do, that is the more time it will take. > > Can't the problem be caused that reset is done on active and inactive > queues every time? It would explain why it had no effect in decreasing > and increasing the number of active queues. Yet it doesn't explain why > Intel card (82599ES) with 64 possible queues has exactly the same > problem as Mellanox (ConnectX-4 LX) with 8 possible queues. Regardless of these queues, the qdisc's should be only reset once, because all of these queues point to the same instance of root qdisc in your case. [...] > With the attached patch I'm down to 1.7 seconds - more than 90% > improvement :-) Can you please check it and pass it to proper places? > According to debugging printk messages it empties only active queues. You can't change netdev_for_each_tx_queue(), it would certainly at least break netif_alloc_netdev_queues(). Let me think how to fix this properly, I have some ideas and will provide you some patch(es) to test soon. Thanks!