On Wed, 2021-09-29 at 00:26 +0000, Ananyev, Konstantin wrote: > > > > > > > > > > > > > > > In current DPDK framework, each RX queue > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > pre-loaded with mbufs > > > > > > > > > > > > > > > for incoming packets. When number of > > > > > > > > > > > > > > > representors scale out in a > > > > > > > > > > > > > > > switch domain, the memory consumption > > > > > > > > > > > > > > > became > > > > > > > > > > > > > > > significant. Most > > > > > > > > > > > > > > > important, polling all ports leads to > > > > > > > > > > > > > > > high > > > > > > > > > > > > > > > cache miss, high > > > > > > > > > > > > > > > latency and low throughput. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This patch introduces shared RX queue. > > > > > > > > > > > > > > > Ports > > > > > > > > > > > > > > > with same > > > > > > > > > > > > > > > configuration in a switch domain could > > > > > > > > > > > > > > > share > > > > > > > > > > > > > > > RX queue set by specifying sharing group. > > > > > > > > > > > > > > > Polling any queue using same shared RX > > > > > > > > > > > > > > > queue > > > > > > > > > > > > > > > receives packets from > > > > > > > > > > > > > > > all member ports. Source port is > > > > > > > > > > > > > > > identified > > > > > > > > > > > > > > > by mbuf->port. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Port queue number in a shared group > > > > > > > > > > > > > > > should be > > > > > > > > > > > > > > > identical. Queue > > > > > > > > > > > > > > > index is > > > > > > > > > > > > > > > 1:1 mapped in shared group. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Share RX queue is supposed to be polled > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > same thread. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Multiple groups is supported by group ID. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Is this offload specific to the > > > > > > > > > > > > > > representor? If > > > > > > > > > > > > > > so can this name be changed specifically to > > > > > > > > > > > > > > representor? > > > > > > > > > > > > > > > > > > > > > > > > > > Yes, PF and representor in switch domain > > > > > > > > > > > > > could > > > > > > > > > > > > > take advantage. > > > > > > > > > > > > > > > > > > > > > > > > > > > If it is for a generic case, how the flow > > > > > > > > > > > > > > ordering will be maintained? > > > > > > > > > > > > > > > > > > > > > > > > > > Not quite sure that I understood your > > > > > > > > > > > > > question. > > > > > > > > > > > > > The control path of is > > > > > > > > > > > > > almost same as before, PF and representor > > > > > > > > > > > > > port > > > > > > > > > > > > > still needed, rte flows not impacted. > > > > > > > > > > > > > Queues still needed for each member port, > > > > > > > > > > > > > descriptors(mbuf) will be > > > > > > > > > > > > > supplied from shared Rx queue in my PMD > > > > > > > > > > > > > implementation. > > > > > > > > > > > > > > > > > > > > > > > > My question was if create a generic > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple > > > > > > > > > > > > ethdev receive queues land into > > > > the same > > > > > > > > > > > > receive queue, In that case, how the flow order > > > > > > > > > > > > is > > > > > > > > > > > > maintained for respective receive queues. > > > > > > > > > > > > > > > > > > > > > > I guess the question is testpmd forward stream? > > > > > > > > > > > The > > > > > > > > > > > forwarding logic has to be changed slightly in > > > > > > > > > > > case > > > > > > > > > > > of shared rxq. > > > > > > > > > > > basically for each packet in rx_burst result, > > > > > > > > > > > lookup > > > > > > > > > > > source stream according to mbuf->port, forwarding > > > > > > > > > > > to > > > > > > > > > > > target fs. > > > > > > > > > > > Packets from same source port could be grouped as > > > > > > > > > > > a > > > > > > > > > > > small burst to process, this will accelerates the > > > > > > > > > > > performance if traffic > > > > come from > > > > > > > > > > > limited ports. I'll introduce some common api to > > > > > > > > > > > do > > > > > > > > > > > shard rxq forwarding, call it with packets > > > > > > > > > > > handling > > > > > > > > > > > callback, so it suites for > > > > > > > > > > > all forwarding engine. Will sent patches soon. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > All ports will put the packets in to the same queue > > > > > > > > > > (share queue), right? Does > > > > > > > > > > this means only single core will poll only, what > > > > > > > > > > will > > > > > > > > > > happen if there are > > > > > > > > > > multiple cores polling, won't it cause problem? > > > > > > > > > > > > > > > > > > > > And if this requires specific changes in the > > > > > > > > > > application, I am not sure about > > > > > > > > > > the solution, can't this work in a transparent way > > > > > > > > > > to > > > > > > > > > > the application? > > > > > > > > > > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8 > > > > > > > > > that > > > > > > > > > aggregate ports > > > > > > > > > in same group into one new port. Users could schedule > > > > > > > > > polling on the > > > > > > > > > aggregated port instead of all member ports. > > > > > > > > > > > > > > > > The v3 still has testpmd changes in fastpath. Right? > > > > > > > > IMO, > > > > > > > > For this > > > > > > > > feature, we should not change fastpath of testpmd > > > > > > > > application. Instead, testpmd can use aggregated ports > > > > > > > > probably as > > > > > > > > separate fwd_engine to show how to use this feature. > > > > > > > > > > > > > > Good point to discuss :) There are two strategies to > > > > > > > polling > > > > > > > a shared > > > > > > > Rxq: > > > > > > > 1. polling each member port > > > > > > > All forwarding engines can be reused to work as > > > > > > > before. > > > > > > > My testpmd patches are efforts towards this direction. > > > > > > > Does your PMD support this? > > > > > > > > > > > > Not unfortunately. More than that, every application needs > > > > > > to > > > > > > change > > > > > > to support this model. > > > > > > > > > > Both strategies need user application to resolve port ID from > > > > > mbuf and > > > > > process accordingly. > > > > > This one doesn't demand aggregated port, no polling schedule > > > > > change. > > > > > > > > I was thinking, mbuf will be updated from driver/aggregator > > > > port as > > > > when it > > > > comes to application. > > > > > > > > > > > > > > > > > > > > > > 2. polling aggregated port > > > > > > > Besides forwarding engine, need more work to to demo > > > > > > > it. > > > > > > > This is an optional API, not supported by my PMD yet. > > > > > > > > > > > > We are thinking of implementing this PMD when it comes to > > > > > > it, > > > > > > ie. > > > > > > without application change in fastpath > > > > > > logic. > > > > > > > > > > Fastpath have to resolve port ID anyway and forwarding > > > > > according > > > > > to > > > > > logic. Forwarding engine need to adapt to support shard Rxq. > > > > > Fortunately, in testpmd, this can be done with an abstract > > > > > API. > > > > > > > > > > Let's defer part 2 until some PMD really support it and > > > > > tested, > > > > > how do > > > > > you think? > > > > > > > > We are not planning to use this feature so either way it is OK > > > > to > > > > me. > > > > I leave to ethdev maintainers decide between 1 vs 2. > > > > > > > > I do have a strong opinion not changing the testpmd basic > > > > forward > > > > engines > > > > for this feature.I would like to keep it simple as fastpath > > > > optimized and would > > > > like to add a separate Forwarding engine as means to verify > > > > this > > > > feature. > > > > > > +1 to that. > > > I don't think it a 'common' feature. > > > So separate FWD mode seems like a best choice to me. > > > > -1 :) > > There was some internal requirement from test team, they need to > > verify > > all features like packet content, rss, vlan, checksum, rte_flow... > > to > > be working based on shared rx queue. > > Then I suppose you'll need to write really comprehensive fwd-engine > to satisfy your test team :) > Speaking seriously, I still don't understand why do you need all > available fwd-engines to verify this feature.
The shared Rxq is low level feature, need to make sure driver higher level features working properly. fwd-engines like csum checks input packet and enable L3/L4 checksum and tunnel offloads accordingly, other engines do their own feature verification. All test automation could be reused with these engines supported seamlessly. > From what I understand, main purpose of your changes to test-pmd: > allow to fwd packet though different fwd_stream (TX through different > HW queue). Yes, each mbuf in burst come from differnt port, testpmd current fwd- engines relies heavily on source forwarding stream, that's why the patch devide burst result mbufs into sub-burst and use orginal fwd- engine callback to handle. How to handle is not changed. > In theory, if implemented in generic and extendable way - that > might be a useful add-on to tespmd fwd functionality. > But current implementation looks very case specific. > And as I don't think it is a common case, I don't see much point to > pollute > basic fwd cases with it. Shared Rxq is a ethdev feature that impacts how packets get handled. It's natural to update forwarding engines to avoid broken. The new macro is introduced to minimize performance impact, I'm also wondering is there an elegant solution :) Current performance penalty is one "if unlikely" per burst. Think in reverse direction, if we don't update fwd-engines here, all malfunction when shared rxq enabled, users can't verify driver features, are you expecting this? > > BTW, as a side note, the code below looks bogus to me: > +void > +forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx, > + struct rte_mbuf **pkts_burst, packet_fwd_cb > fwd) > +{ > + uint16_t i, nb_fs_rx = 1, port; > + > + /* Locate real source fs according to mbuf->port. */ > + for (i = 0; i < nb_rx; ++i) { > + rte_prefetch0(pkts_burst[i + 1]); > > you access pkt_burst[] beyond array boundaries, > also you ask cpu to prefetch some unknown and possibly invalid > address. > > > Based on the patch, I believe the > > impact has been minimized. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Overall, is this for optimizing memory for the port > > > > > > > > > > represontors? If so can't we > > > > > > > > > > have a port representor specific solution, reducing > > > > > > > > > > scope can reduce the > > > > > > > > > > complexity it brings? > > > > > > > > > > > > > > > > > > > > > > If this offload is only useful for representor > > > > > > > > > > > > case, Can we make this offload specific to > > > > > > > > > > > > representor the case by changing its > > > > name and > > > > > > > > > > > > scope. > > > > > > > > > > > > > > > > > > > > > > It works for both PF and representors in same > > > > > > > > > > > switch > > > > > > > > > > > domain, for application like OVS, few changes to > > > > > > > > > > > apply. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Xueming Li > > > > > > > > > > > > > > > <xuemi...@nvidia.com> > > > > > > > > > > > > > > > --- > > > > > > > > > > > > > > > doc/guides/nics/features.rst > > > > > > > > > > > > > > > > 11 +++++++++++ > > > > > > > > > > > > > > > doc/guides/nics/features/default.ini > > > > > > > > > > > > > > > > 1 + > > > > > > > > > > > > > > > doc/guides/prog_guide/switch_representat > > > > > > > > > > > > > > > ion. > > > > > > > > > > > > > > > rst | 10 ++++++++++ > > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.c > > > > > > > > > > > > > > > > 1 + > > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.h > > > > > > > > > > > > > > > > 7 +++++++ > > > > > > > > > > > > > > > 5 files changed, 30 insertions(+) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst > > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index > > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644 > > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst > > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner > > > > > > > > > > > > > > > packet L4 > > > > > > > > > > > > > > > checksum. > > > > > > > > > > > > > > > ``tx_offload_capa,tx_queue_offload_cap > > > > > > > > > > > > > > > a:DE > > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue: > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > +Shared Rx queue > > > > > > > > > > > > > > > +--------------- > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > +Supports shared Rx queue for ports in > > > > > > > > > > > > > > > same > > > > > > > > > > > > > > > switch domain. > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > +* **[uses] > > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**: > > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ` > > > > > > > > > > > > > > > `. > > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``. > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > .. _nic_features_packet_type_parsing: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Packet type parsing > > > > > > > > > > > > > > > diff --git > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini > > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644 > > > > > > > > > > > > > > > --- > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini > > > > > > > > > > > > > > > +++ > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini > > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand > > > > > > > > > > > > > > > = > > > > > > > > > > > > > > > Queue start/stop = > > > > > > > > > > > > > > > Runtime Rx queue setup = > > > > > > > > > > > > > > > Runtime Tx queue setup = > > > > > > > > > > > > > > > +Shared Rx queue = > > > > > > > > > > > > > > > Burst mode info = > > > > > > > > > > > > > > > Power mgmt address monitor = > > > > > > > > > > > > > > > MTU update = > > > > > > > > > > > > > > > diff --git > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa > > > > > > > > > > > > > > > tion > > > > > > > > > > > > > > > .rst > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa > > > > > > > > > > > > > > > tion > > > > > > > > > > > > > > > .rst > > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644 > > > > > > > > > > > > > > > --- > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa > > > > > > > > > > > > > > > tion > > > > > > > > > > > > > > > .rst > > > > > > > > > > > > > > > +++ > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa > > > > > > > > > > > > > > > tion > > > > > > > > > > > > > > > .rst > > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a > > > > > > > > > > > > > > > software > > > > > > > > > > > > > > > "patch panel" front-end for applications. > > > > > > > > > > > > > > > .. [1] `Ethernet switch device driver > > > > > > > > > > > > > > > model > > > > > > > > > > > > > > > (switchdev) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation > > > > > > > > > > > > > > > /net > > > > > > > > > > > > > > > working/switchdev.txt > > > > > > > > > > > > > > > > `_ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > +- Memory usage of representors is huge > > > > > > > > > > > > > > > when > > > > > > > > > > > > > > > number of representor > > > > > > > > > > > > > > > +grows, > > > > > > > > > > > > > > > + because PMD always allocate mbuf for > > > > > > > > > > > > > > > each > > > > > > > > > > > > > > > descriptor of Rx queue. > > > > > > > > > > > > > > > + Polling the large number of ports > > > > > > > > > > > > > > > brings > > > > > > > > > > > > > > > more CPU load, cache > > > > > > > > > > > > > > > +miss and > > > > > > > > > > > > > > > + latency. Shared Rx queue can be used > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > share Rx queue between > > > > > > > > > > > > > > > +PF and > > > > > > > > > > > > > > > + representors in same switch domain. > > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ`` > > > > > > > > > > > > > > > + is present in Rx offloading capability > > > > > > > > > > > > > > > of > > > > > > > > > > > > > > > device info. Setting > > > > > > > > > > > > > > > +the > > > > > > > > > > > > > > > + offloading flag in device Rx mode or > > > > > > > > > > > > > > > Rx > > > > > > > > > > > > > > > queue configuration to > > > > > > > > > > > > > > > +enable > > > > > > > > > > > > > > > + shared Rx queue. Polling any member > > > > > > > > > > > > > > > port > > > > > > > > > > > > > > > of shared Rx queue can > > > > > > > > > > > > > > > +return > > > > > > > > > > > > > > > + packets of all ports in group, port ID > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > saved in ``mbuf.port``. > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > Basic SR-IOV > > > > > > > > > > > > > > > ------------ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c > > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644 > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c > > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct { > > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_ > > > > > > > > > > > > > > > CKSU > > > > > > > > > > > > > > > M), > > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH), > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER > > > > > > > > > > > > > > > _SPL > > > > > > > > > > > > > > > IT), > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ), > > > > > > > > > > > > > > > }; > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h > > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644 > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h > > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct > > > > > > > > > > > > > > > rte_eth_rxconf { > > > > > > > > > > > > > > > uint8_t rx_drop_en; /**< Drop > > > > > > > > > > > > > > > packets > > > > > > > > > > > > > > > if no descriptors are available. */ > > > > > > > > > > > > > > > uint8_t rx_deferred_start; /**< > > > > > > > > > > > > > > > Do > > > > > > > > > > > > > > > not start queue with rte_eth_dev_start(). > > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > > uint16_t rx_nseg; /**< Number of > > > > > > > > > > > > > > > descriptions in rx_seg array. > > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > > + uint32_t shared_group; /**< > > > > > > > > > > > > > > > Shared > > > > > > > > > > > > > > > port group index in > > > > > > > > > > > > > > > + switch domain. */ > > > > > > > > > > > > > > > /** > > > > > > > > > > > > > > > * Per-queue Rx offloads to be > > > > > > > > > > > > > > > set > > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags. > > > > > > > > > > > > > > > * Only offloads set on > > > > > > > > > > > > > > > rx_queue_offload_capa or > > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ > > > > > > > > > > > > > > > struct > > > > > > > > > > > > > > > rte_eth_conf { > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM > > > > > > > > > > > > > > > 0x00040000 > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH > > > > > > > > > > > > > > > 0x00080000 > > > > > > > > > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT > > > > > > > > > > > > > > > 0x00100000 > > > > > > > > > > > > > > > +/** > > > > > > > > > > > > > > > + * Rx queue is shared among ports in > > > > > > > > > > > > > > > same > > > > > > > > > > > > > > > switch domain to save > > > > > > > > > > > > > > > +memory, > > > > > > > > > > > > > > > + * avoid polling each port. Any port in > > > > > > > > > > > > > > > group can be used to receive packets. > > > > > > > > > > > > > > > + * Real source port number saved in > > > > > > > > > > > > > > > mbuf- > > > > > > > > > > > > > > > > port field. > > > > > > > > > > > > > > > + */ > > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ > > > > > > > > > > > > > > > 0x00200000 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM > > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \ > > > > > > > > > > > > > > > DEV_RX_O > > > > > > > > > > > > > > > FFLO > > > > > > > > > > > > > > > AD_UDP_CKSUM | \ > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > 2.25.1 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >