On 16-05-19 01:39 PM, Alexander Duyck wrote: > On Thu, May 19, 2016 at 12:35 PM, Eric Dumazet <eric.duma...@gmail.com> wrote: >> On Thu, 2016-05-19 at 11:56 -0700, Eric Dumazet wrote: >> >>> Removing busylock helped in all cases I tested. (at least on x86 as >>> David pointed out) >>> >>> As I said, we need to revisit busylock now that spinlocks are different. >>> >>> In one case (20 concurrent UDP netperf), I even got a 500 % increase. >>> >>> With busylock : >>> >>> lpaa5:~# sar -n DEV 4 4|grep eth0 >>> Average: eth0 12.19 112797.12 1.95 37672.28 0.00 >>> 0.00 0.69 >>> >> >> >> Hmpf, my sysctl logic was inverted. Really these results made little >> sense. >> >> Sorry for the noise. At least we have 8% confirmed gain with this >> stuff ;) > > Unfortunately I see a 21% regression when you place the qdisc under > load from multiple CPUs/nodes. In my testing I dropped from around > 1.05 Mpps to around 827 Kpps when I removed the busylock. > > My test setup is pretty straight forward and the results I have seen > are consistent between runs. I have a system with 32 threads / 16 > cores spread over 2 NUMA nodes. I reduce the number of queues on a > 10Gb/s NIC to 1. I kill irqbalance and disable CPU power states. I > then start a quick "for" loop that will schedule one UDP_STREAM > netperf session on each CPU using a small payload size like 32. > > On a side note, if I move everything onto one node I can push about > 2.4 Mpps and the busylock doesn't seem to really impact things, but if > I go to both nodes then I see the performance regression as described > above. I was thinking about it and I don't think the MCS type locks > would have that much of an impact. If anything I think that xmit_more > probably has a bigger effect given that it allows us to grab multiple > frames with each fetch and thereby reduce the lock contention on the > dequeue side. > >>> Presumably it would tremendously help if the actual kfree_skb() >>> was done after qdisc lock is released, ie not from the qdisc->enqueue() >>> method. >>> >> >> This part is still valid. >> >> We could have a per cpu storage of one skb pointer, so that we do not >> have to change all ->enqueue() prototypes. > > I fully agree with that. > > One thought I had is if we could move to a lockless dequeue path for > the qdisc then we could also get rid of the busy lock. I know there > has been talk about doing away with qdisc locks or changing the inner > mechanics of the qdisc itself in the past, I'm CCing Jesper and John > for that reason. Maybe it is time for us to start looking into that > so we can start cleaning out some of the old cruft we have in this > path. > > - Alex >
I plan to start looking at this again in June when I have some more time FWIW. The last set of RFCs I sent out bypassed both the qdisc lock and the busy poll lock. I remember thinking this was a net win at the time but I only did very basic testing e.g. firing up n sessions of pktgen. And because we are talking about cruft I always thought the gso_skb requeue logic could be done away with as well. As far as I can tell it must be there from some historic code that has been re-factored or deleted pre-git days. It would be much better I think (no data) to use byte queue limits or some other way to ensure the driver can consume the packet vs popping and pushing skbs around. .John