On Thu, 2016-05-19 at 11:03 -0700, Alexander Duyck wrote: > On Thu, May 19, 2016 at 10:08 AM, Eric Dumazet <eric.duma...@gmail.com> wrote: > > busylock was added at the time we had expensive ticket spinlocks > > > > (commit 79640a4ca6955e3ebdb7038508fa7a0cd7fa5527 ("net: add additional > > lock to qdisc to increase throughput") > > > > Now kernel spinlocks are MCS, this busylock things is no longer > > relevant. It is slowing down things a bit. > > > > > > With HTB qdisc, here are the numbers for 200 concurrent TCP_RR, on a host > > with 48 hyperthreads. > > > > lpaa5:~# sar -n DEV 4 4 |grep eth0 > > 10:05:44 eth0 798951.25 798951.75 52276.22 52275.26 0.00 > > 0.00 0.50 > > 10:05:48 eth0 798576.00 798572.75 52251.24 52250.39 0.00 > > 0.00 0.75 > > 10:05:52 eth0 798746.00 798748.75 52262.89 52262.13 0.00 > > 0.00 0.50 > > 10:05:56 eth0 798303.25 798291.50 52235.22 52233.10 0.00 > > 0.00 0.50 > > Average: eth0 798644.12 798641.19 52256.39 52255.22 0.00 > > 0.00 0.56 > > > > Disabling busylock (by using a local sysctl) > > > > lpaa5:~# sar -n DEV 4 4 |grep eth0 > > 10:05:14 eth0 864085.75 864091.50 56538.09 56537.46 0.00 > > 0.00 0.50 > > 10:05:18 eth0 864734.75 864729.25 56580.35 56579.05 0.00 > > 0.00 0.75 > > 10:05:22 eth0 864366.00 864361.50 56556.74 56555.00 0.00 > > 0.00 0.50 > > 10:05:26 eth0 864246.50 864248.75 56549.19 56547.65 0.00 > > 0.00 0.50 > > Average: eth0 864358.25 864357.75 56556.09 56554.79 0.00 > > 0.00 0.56 > > > > That would be a 8 % increase. > > The main point of the busy lock is to deal with the bulk throughput > case, not the latency case which would be relatively well behaved. > The problem wasn't really related to lock bouncing slowing things > down. It was the fairness between the threads that was killing us > because the dequeue needs to have priority.
> > The main problem that the busy lock solved was the fact that you could > start a number of stream tests equal to the number of CPUs in a given > system and the result was that the performance would drop off a cliff > and you would drop almost all the packets for almost all the streams > because the qdisc never had a chance to drain because it would be CPU > - 1 enqueues, followed by 1 dequeue. > > What we need if we are going to get rid of busy lock would be some > sort of priority locking mechanism that would allow the dequeue thread > to jump to the head of the line if it is attempting to take the lock. > Otherwise you end up spending all your time enqueuing packets into > oblivion because the qdiscs just overflow without the busy lock in > place. Removing busylock helped in all cases I tested. (at least on x86 as David pointed out) As I said, we need to revisit busylock now that spinlocks are different. In one case (20 concurrent UDP netperf), I even got a 500 % increase. With busylock : lpaa5:~# sar -n DEV 4 4|grep eth0 11:33:34 eth0 9.00 115057.00 1.60 38426.92 0.00 0.00 0.50 11:33:38 eth0 13.50 113237.75 2.04 37819.69 0.00 0.00 0.75 11:33:42 eth0 13.50 111492.25 1.76 37236.58 0.00 0.00 0.75 11:33:46 eth0 12.75 111401.50 2.40 37205.93 0.00 0.00 0.75 Average: eth0 12.19 112797.12 1.95 37672.28 0.00 0.00 0.69 Packets are dropped in HTB because we hit a limit of 1000 packets there - 100.00% netperf [kernel.kallsyms] [k] kfree_skb ▒ - kfree_skb ▒ - 100.00% htb_enqueue ▒ __dev_queue_xmit ▒ dev_queue_xmit ▒ ip_finish_output2 ▒ ip_finish_output ▒ ip_output ▒ ip_local_out ▒ + ip_send_skb Presumably it would tremendously help if the actual kfree_skb() was done after qdisc lock is released, ie not from the qdisc->enqueue() method. Without busylock : lpaa5:~# sar -n DEV 4 4|grep eth0 11:41:12 eth0 11.00 669053.50 1.99 223452.30 0.00 0.00 0.75 11:41:16 eth0 8.50 669513.25 2.27 223605.55 0.00 0.00 0.75 11:41:20 eth0 3.50 669426.50 0.90 223577.19 0.00 0.00 0.50 11:41:24 eth0 8.25 669284.00 1.42 223529.79 0.00 0.00 0.50 Average: eth0 7.81 669319.31 1.65 223541.21 0.00 0.00 0.62