On Wed, 30 Mar 2016 00:20:03 -0700 Michael Ma <make0...@gmail.com> wrote:
> I know this might be an old topic so bare with me – what we are facing > is that applications are sending small packets using hundreds of > threads so the contention on spin lock in __dev_xmit_skb increases the > latency of dev_queue_xmit significantly. We’re building a network QoS > solution to avoid interference of different applications using HTB. Yes, as you have noticed with HTB there is a single qdisc lock, and congestion obviously happens :-) It is possible with different tricks to make it scale. I believe Google is using a variant of HTB, and it scales for them. They have not open source their modifications to HTB (which likely also involves a great deal of setup tricks). If your purpose it to limit traffic/bandwidth per "cloud" instance, then you can just use another TC setup structure. Like using MQ and assigning a HTB per MQ queue (where the MQ queues are bound to each CPU/HW queue)... But you have to figure out this setup yourself... > But in this case when some applications send massive small packets in > parallel, the application to be protected will get its throughput > affected (because it’s doing synchronous network communication using > multiple threads and throughput is sensitive to the increased latency) > > Here is the profiling from perf: > > - 67.57% iperf [kernel.kallsyms] [k] _spin_lock > - 99.94% dev_queue_xmit > - 96.91% _spin_lock > > - 2.62% __qdisc_run > > - 98.98% sch_direct_xmit > - 99.98% _spin_lock > > As far as I understand the design of TC is to simplify locking schema > and minimize the work in __qdisc_run so that throughput won’t be > affected, especially with large packets. However if the scenario is > that multiple classes in the queueing discipline only have the shaping > limit, there isn’t really a necessary correlation between different > classes. The only synchronization point should be when the packet is > dequeued from the qdisc queue and enqueued to the transmit queue of > the device. My question is – is it worth investing on avoiding the > locking contention by partitioning the queue/lock so that this > scenario is addressed with relatively smaller latency? Yes, there is a lot go gain, but it is not easy ;-) > I must have oversimplified a lot of details since I’m not familiar > with the TC implementation at this point – just want to get your input > in terms of whether this is a worthwhile effort or there is something > fundamental that I’m not aware of. If this is just a matter of quite > some additional work, would also appreciate helping to outline the > required work here. > > Also would appreciate if there is any information about the latest > status of this work http://www.ijcset.com/docs/IJCSET13-04-04-113.pdf This article seems to be very low quality... spelling errors, only 5 pages, no real code, etc. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer