On 2020-07-08 09:44, Cong Wang wrote:
On Fri, Jun 26, 2020 at 3:46 AM Maxim Mikityanskiy <maxi...@mellanox.com> wrote:

HTB doesn't scale well because of contention on a single lock, and it
also consumes CPU. Mellanox hardware supports hierarchical rate limiting
that can be leveraged by offloading the functionality of HTB.

True, essentially because it has to enforce a global rate limit with
link sharing.

There is a proposal of adding a new lockless shaping qdisc, which
you can find in netdev list.

Thanks for pointing out! It's sch_ltb (lockless token bucket), right? I see it's very recent. I'll certainly have to dig deeper to understand all the details, but as I got, LTB still has a bottleneck of a single queue ("drain queue") processed by a single thread, but what makes a difference is that enqueue and dequeue are cheap, all algorithm processing is taken out of these functions, and they work on per-CPU queues.


Our solution addresses two problems of HTB:

1. Contention by flow classification. Currently the filters are attached
to the HTB instance as follows:

     # tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80
     classid 1:10

It's possible to move classification to clsact egress hook, which is
thread-safe and lock-free:

     # tc filter add dev eth0 egress protocol ip flower dst_port 80
     action skbedit priority 1:10

This way classification still happens in software, but the lock
contention is eliminated, and it happens before selecting the TX queue,
allowing the driver to translate the class to the corresponding hardware
queue.

Note that this is already compatible with non-offloaded HTB and doesn't
require changes to the kernel nor iproute2.

2. Contention by handling packets. HTB is not multi-queue, it attaches
to a whole net device, and handling of all packets takes the same lock.
Our solution offloads the logic of HTB to the hardware and registers HTB
as a multi-queue qdisc, similarly to how mq qdisc does, i.e. HTB is
attached to the netdev, and each queue has its own qdisc. The control
flow is performed by HTB, it replicates the hierarchy of classes in
hardware by calling callbacks of the driver. Leaf classes are presented
by hardware queues. The data path works as follows: a packet is
classified by clsact, the driver selectes the hardware queue according
to its class, and the packet is enqueued into this queue's qdisc.

Are you sure the HTB algorithm could still work even after you
kinda make each HTB class separated? I think they must still share
something when they borrow bandwidth from each other. This is why I
doubt you can simply add a ->attach() without touching the core
algorithm.

The core algorithm is offloaded to the hardware, the NIC does all the shaping, so all we need to do on the kernel side is put packets into the correct hardware queues.

I think offloading the algorithm processing could give an extra benefit over the purely software implementation of LTB, but that is something I need to explore (e.g., is it realistic to reach the drain queue bottleneck with LTB; how much CPU usage can be saved with HTB offload).

Thank you for your feedback!

Thanks.


Reply via email to