Le 21-10-21 à 12:41, Xueming Li a écrit :
In current DPDK framework, all Rx queues is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Further more,
polling all ports leads to high cache miss, high latency and low
throughputs.
This patch introduces shared Rx queue. PF and representors in same
Rx domain and switch domain could share Rx queue set by specifying
non-zero share group value in Rx queue configuration.
All ports that share Rx queue actually shares hardware descriptor
queue and feed all Rx queues with one descriptor supply, memory is saved.
Polling any queue using same shared Rx queue receives packets from all
member ports. Source port is identified by mbuf->port.
Multiple groups is supported by group ID. Port queue number in a shared
group should be identical. Queue index is 1:1 mapped in shared group.
An example of two share groups:
Group1, 4 shared Rx queues per member port: PF, repr0, repr1
Group2, 2 shared Rx queues per member port: repr2, repr3, ... repr127
Poll first port for each group:
core port queue
0 0 0
1 0 1
2 0 2
3 0 3
4 2 0
5 2 1
Shared Rx queue must be polled on single thread or core. If both PF0 and
representor0 joined same share group, can't poll pf0rxq0 on core1 and
rep0rxq0 on core2. Actually, polling one port within share group is
sufficient since polling any port in group will return packets for any
port in group.
There was some discussion to aggregate member ports in same group into a
dummy port, several ways to achieve it. Since it optional, need to collect
more feedback and requirement from user, make better decision later.
v1:
- initial version
v2:
- add testpmd patches
v3:
- change common forwarding api to macro for performance, thanks Jerin.
- save global variable accessed in forwarding to flowstream to minimize
cache miss
- combined patches for each forwarding engine
- support multiple groups in testpmd "--share-rxq" parameter
- new api to aggregate shared rxq group
v4:
- spelling fixes
- remove shared-rxq support for all forwarding engines
- add dedicate shared-rxq forwarding engine
v5:
- fix grammars
- remove aggregate api and leave it for later discussion
- add release notes
- add deployment example
v6:
- replace RxQ offload flag with device offload capability flag
- add Rx domain
- RxQ is shared when share group > 0
- update testpmd accordingly
v7:
- fix testpmd share group id allocation
- change rx_domain to 16bits
v8:
- add new patch for testpmd to show device Rx domain ID and capability
- new share_qid in RxQ configuration
v9:
- fix some spelling
v10:
- add device capability name api
v11:
- remove macro from device capability name list
v12:
- rephrase
- in forwarding core check, add global flag and RxQ enabled check
v13:
- update imports of new forwarding engine
- rephrase
Xueming Li (7):
ethdev: introduce shared Rx queue
ethdev: get device capability name as string
app/testpmd: dump device capability and Rx domain info
app/testpmd: new parameter to enable shared Rx queue
app/testpmd: dump port info for shared Rx queue
app/testpmd: force shared Rx queue polled on same core
app/testpmd: add forwarding engine for shared Rx queue
app/test-pmd/config.c | 141 +++++++++++++++++-
app/test-pmd/meson.build | 1 +
app/test-pmd/parameters.c | 13 ++
app/test-pmd/shared_rxq_fwd.c | 115 ++++++++++++++
app/test-pmd/testpmd.c | 26 +++-
app/test-pmd/testpmd.h | 5 +
app/test-pmd/util.c | 3 +
doc/guides/nics/features.rst | 13 ++
doc/guides/nics/features/default.ini | 1 +
.../prog_guide/switch_representation.rst | 11 ++
doc/guides/rel_notes/release_21_11.rst | 6 +
doc/guides/testpmd_app_ug/run_app.rst | 9 ++
doc/guides/testpmd_app_ug/testpmd_funcs.rst | 5 +-
lib/ethdev/rte_ethdev.c | 33 ++++
lib/ethdev/rte_ethdev.h | 38 +++++
lib/ethdev/version.map | 1 +
16 files changed, 415 insertions(+), 6 deletions(-)
create mode 100644 app/test-pmd/shared_rxq_fwd.c
Hi all,
Sorry to jump in this late but I think this solves only a consequence of
another "problem", the fact the mbuf descriptor is coupled with the
buffer. And you might want to consider another approach that does not
require API change.
The problem (partially solved by this patch) is that you'll "touch" many
descriptors (the rte_mbuf itself) if you have many queues, or even a few
queues but with quite large rings. Those descriptors, will all be likely
out of cache when you access them. However, as we demonstrated with mlx5
(see https://packetmill.io/) you can build a descriptor from scratch out
of the NIC hw ring that points to the underlying buffer in an indirect
way. This descriptor can be taken out of the thread-local buffer pool.
You'll actualy keep as much mbufs descriptors in-flight as your burst
size. Which probably even defeats what this patch can do, as you can
actually use only 32 descriptors per thread for any number of queues of
any size.
What that solution does not solve is the need to poll many different
queues. I think that is orthogonal, with the NICs getting smarter we're
going to have many rules sending traffic to per-application,
per-priority queues anyway. Maybe even per-microflows. To solve this we
would need a kind of queue bitmask set in hw to indicate which queue to
poll instead of trying all of them. Maybe this can be done through a FW
update? It's a feature we'll want in the future in any cases.
The shared RX queue is surely an easy fix for the polling itself, but
one problem of the shared RX queue is that it will lead to scattered
batches. We'll get batches of packets from all ports that will surely
take different code path for anything above forwarding, breaking the
benefit of batching (this can also lead up to 50% of performance penalty
due to interleaved burst, see
https://people.kth.se/~dejanko/documents/publications/ordermatters-nsdi22.pdf).
Cheers,
Tom