2017-11-13 18:05 GMT-08:00 Michael Ma <make0...@gmail.com>: > 2017-11-13 15:08 GMT-08:00 Eric Dumazet <eric.duma...@gmail.com>: >> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote: >>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0...@gmail.com> wrote: >>> > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <step...@networkplumber.org>: >>> >> On Sun, 12 Nov 2017 13:43:13 -0800 >>> >> Michael Ma <make0...@gmail.com> wrote: >>> >> >>> >>> Any comments? We plan to implement this as a qdisc and appreciate any >>> >>> early feedback. >>> >>> >>> >>> Thanks, >>> >>> Michael >>> >>> >>> >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0...@gmail.com> wrote: >>> >>> > >>> >>> > Currently txq/qdisc selection is based on flow hash so packets from >>> >>> > the same flow will follow the order when they enter qdisc/txq, which >>> >>> > avoids out-of-order problem. >>> >>> > >>> >>> > To improve the concurrency of QoS algorithm we plan to have multiple >>> >>> > per-cpu queues for a single TC class and do busy polling from a >>> >>> > per-class thread to drain these queues. If we can do this frequently >>> >>> > enough the out-of-order situation in this polling thread should not be >>> >>> > that bad. >>> >>> > >>> >>> > To give more details - in the send path we introduce per-cpu per-class >>> >>> > queues so that packets from the same class and same core will be >>> >>> > enqueued to the same place. Then a per-class thread poll the queues >>> >>> > belonging to its class from all the cpus and aggregate them into >>> >>> > another per-class queue. This can effectively reduce contention but >>> >>> > inevitably introduces potential out-of-order issue. >>> >>> > >>> >>> > Any concern/suggestion for working towards this direction? >>> >> >>> >> In general, there is no meta design discussions in Linux development >>> >> Several developers have tried to do lockless >>> >> qdisc and similar things in the past. >>> >> >>> >> The devil is in the details, show us the code. >>> > >>> > Thanks for the response, Stephen. The code is fairly straightforward, >>> > we have a per-cpu per-class queue defined as this: >>> > >>> > struct bandwidth_group >>> > { >>> > struct skb_list queues[MAX_CPU_COUNT]; >>> > struct skb_list drain; >>> > } >>> > >>> > "drain" queue is used to aggregate per-cpu queues belonging to the >>> > same class. In the enqueue function, we determine the cpu where the >>> > packet is processed and enqueue it to the corresponding per-cpu queue: >>> > >>> > int cpu; >>> > struct bandwidth_group *bwg = &bw_rx_groups[bwgid]; >>> > >>> > cpu = get_cpu(); >>> > skb_list_append(&bwg->queues[cpu], skb); >>> > >>> > Here we don't check the flow of the packet so if there is task >>> > migration or multiple threads sending packets through the same flow we >>> > theoretically can have packets enqueued to different queues and >>> > aggregated to the "drain" queue out of order. >>> > >>> > Also AFAIK there is no lockless htb-like qdisc implementation >>> > currently, however if there is already similar effort ongoing please >>> > let me know. >>> >>> The question I would have is how would this differ from using XPS w/ >>> mqprio? Would this be a classful qdisc like HTB or a classless one >>> like mqprio? >>> >>> From what I can tell XPS would be able to get you your per-cpu >>> functionality, the benefit of it though would be that it would avoid >>> out-of-order issues for sockets originating on the local system. The >>> only thing I see as an issue right now is that the rate limiting with >>> mqprio is assumed to be handled via hardware due to mechanisms such as >>> DCB. >> >> I think one of the key point was in : " do busy polling from a per-class >> thread to drain these queues." >> >> I mentioned this idea in TX path of : >> >> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf >> >> > > Right - this part is the key difference. With mqprio we still don't > have the ability to explore parallelism at the level of class. The > parallelism is restricted to the way of partitioning flows across > queues. > >>
Eric - do you think if we do busy polling frequently enough out-of-order problem will effectively be mitigated? I'll take a look at your slides as well.