Re: Per-CPU Queueing for QoS

Michael Ma Mon, 13 Nov 2017 18:07:17 -0800

2017-11-13 18:05 GMT-08:00 Michael Ma <make0...@gmail.com>:
> 2017-11-13 15:08 GMT-08:00 Eric Dumazet <eric.duma...@gmail.com>:
>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
>>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0...@gmail.com> wrote:
>>> > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <step...@networkplumber.org>:
>>> >> On Sun, 12 Nov 2017 13:43:13 -0800
>>> >> Michael Ma <make0...@gmail.com> wrote:
>>> >>
>>> >>> Any comments? We plan to implement this as a qdisc and appreciate any 
>>> >>> early feedback.
>>> >>>
>>> >>> Thanks,
>>> >>> Michael
>>> >>>
>>> >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0...@gmail.com> wrote:
>>> >>> >
>>> >>> > Currently txq/qdisc selection is based on flow hash so packets from
>>> >>> > the same flow will follow the order when they enter qdisc/txq, which
>>> >>> > avoids out-of-order problem.
>>> >>> >
>>> >>> > To improve the concurrency of QoS algorithm we plan to have multiple
>>> >>> > per-cpu queues for a single TC class and do busy polling from a
>>> >>> > per-class thread to drain these queues. If we can do this frequently
>>> >>> > enough the out-of-order situation in this polling thread should not be
>>> >>> > that bad.
>>> >>> >
>>> >>> > To give more details - in the send path we introduce per-cpu per-class
>>> >>> > queues so that packets from the same class and same core will be
>>> >>> > enqueued to the same place. Then a per-class thread poll the queues
>>> >>> > belonging to its class from all the cpus and aggregate them into
>>> >>> > another per-class queue. This can effectively reduce contention but
>>> >>> > inevitably introduces potential out-of-order issue.
>>> >>> >
>>> >>> > Any concern/suggestion for working towards this direction?
>>> >>
>>> >> In general, there is no meta design discussions in Linux development
>>> >> Several developers have tried to do lockless
>>> >> qdisc and similar things in the past.
>>> >>
>>> >> The devil is in the details, show us the code.
>>> >
>>> > Thanks for the response, Stephen. The code is fairly straightforward,
>>> > we have a per-cpu per-class queue defined as this:
>>> >
>>> > struct bandwidth_group
>>> > {
>>> >     struct skb_list queues[MAX_CPU_COUNT];
>>> >     struct skb_list drain;
>>> > }
>>> >
>>> > "drain" queue is used to aggregate per-cpu queues belonging to the
>>> > same class. In the enqueue function, we determine the cpu where the
>>> > packet is processed and enqueue it to the corresponding per-cpu queue:
>>> >
>>> > int cpu;
>>> > struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
>>> >
>>> > cpu = get_cpu();
>>> > skb_list_append(&bwg->queues[cpu], skb);
>>> >
>>> > Here we don't check the flow of the packet so if there is task
>>> > migration or multiple threads sending packets through the same flow we
>>> > theoretically can have packets enqueued to different queues and
>>> > aggregated to the "drain" queue out of order.
>>> >
>>> > Also AFAIK there is no lockless htb-like qdisc implementation
>>> > currently, however if there is already similar effort ongoing please
>>> > let me know.
>>>
>>> The question I would have is how would this differ from using XPS w/
>>> mqprio? Would this be a classful qdisc like HTB or a classless one
>>> like mqprio?
>>>
>>> From what I can tell XPS would be able to get you your per-cpu
>>> functionality, the benefit of it though would be that it would avoid
>>> out-of-order issues for sockets originating on the local system. The
>>> only thing I see as an issue right now is that the rate limiting with
>>> mqprio is assumed to be handled via hardware due to mechanisms such as
>>> DCB.
>>
>> I think one of the key point was in : " do busy polling from a per-class
>> thread to drain these queues."
>>
>> I mentioned this idea in TX path of :
>>
>> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf
>>
>>
>
> Right - this part is the key difference. With mqprio we still don't
> have the ability to explore parallelism at the level of class. The
> parallelism is restricted to the way of partitioning flows across
> queues.
>
>>


Eric - do you think if we do busy polling frequently enough
out-of-order problem will effectively be mitigated? I'll take a look
at your slides as well.

Re: Per-CPU Queueing for QoS

Reply via email to