On 2020-12-11 21:16, Cong Wang wrote:
On Fri, Dec 11, 2020 at 7:26 AM Maxim Mikityanskiy <maxi...@mellanox.com> wrote:

HTB doesn't scale well because of contention on a single lock, and it
also consumes CPU. This patch adds support for offloading HTB to
hardware that supports hierarchical rate limiting.

This solution addresses two main problems of scaling HTB:

1. Contention by flow classification. Currently the filters are attached
to the HTB instance as follows:

I do not think this is the reason, tcf_classify() has been called with RCU
only on the ingress side for a rather long time. What contentions are you
talking about here?

When one attaches filters to HTB, tcf_classify is called from htb_classify, which is called from htb_enqueue, which is called with the root spinlock of the qdisc taken.


     # tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80
     classid 1:10

It's possible to move classification to clsact egress hook, which is
thread-safe and lock-free:

     # tc filter add dev eth0 egress protocol ip flower dst_port 80
     action skbedit priority 1:10

This way classification still happens in software, but the lock
contention is eliminated, and it happens before selecting the TX queue,
allowing the driver to translate the class to the corresponding hardware
queue.

Sure, you can use clsact with HTB, or any combinations you like, but you
can't assume your HTB only works with clsact, can you?

The goal is to eliminate the root lock from the datapath, and the traditional filters attached to the HTB itself are handled under that lock. I believe it's a sane limitation, given that the offloaded mode is a new mode of operation, it's opt-in, and it may also have additional hardware-imposed limitations.



Note that this is already compatible with non-offloaded HTB and doesn't
require changes to the kernel nor iproute2.

2. Contention by handling packets. HTB is not multi-queue, it attaches
to a whole net device, and handling of all packets takes the same lock.
When HTB is offloaded, its algorithm is done in hardware. HTB registers
itself as a multi-queue qdisc, similarly to mq: HTB is attached to the
netdev, and each queue has its own qdisc. The control flow is still done
by HTB: it calls the driver via ndo_setup_tc to replicate the hierarchy
of classes in the NIC. Leaf classes are presented by hardware queues.
The data path works as follows: a packet is classified by clsact, the
driver selects a hardware queue according to its class, and the packet
is enqueued into this queue's qdisc.

I do _not_ read your code, from what you describe here, it sounds like
you just want a per-queue rate limit, instead of a global one. So why
bothering HTB whose goal is a global rate limit?

I would disagree. HTB's goal is hierarchical rate limits with borrowing. Sure, it can be used just to set a global limit, but it's main purpose is creating a hierarchy of classes.

And yes, we are talking about the whole netdevice here, not about per-queue limits (we already support per-queue rate limits with the means of tx_maxrate, so we wouldn't need any new code for that). The tree of classes is global for the whole netdevice, with hierarchy and borrowing supported. These additional send queues can be considered as hardware objects that represent offloaded leaf traffic classes (which can be extended to multiple queues per class).

So, we are really offloading HTB functionality here, not just using HTB interface for something else (something simpler). I hope it sounds better for you now.

And doesn't TBF already work with mq? I mean you can attach it as
a leaf to each mq so that the tree lock will not be shared either, but you'd
lose the benefits of a global rate limit too.

Yes, I'd lose not only the global rate limit, but also multi-level hierarchical limits, which are all provided by this HTB offload - that's why TBF is not really a replacement for this feature.

EDT does basically the same,
but it never claims to completely replace HTB. ;)

Thanks.


Reply via email to