On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote: > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li <xuemi...@nvidia.com> > wrote: > > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote: > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemi...@nvidia.com> > > > wrote: > > > > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote: > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote: > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Jerin Jacob <jerinjac...@gmail.com> > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM > > > > > > > To: Xueming(Steven) Li <xuemi...@nvidia.com> > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit > > > > > > > <ferruh.yi...@intel.com>; NBU-Contact-Thomas Monjalon > > > > > > > <tho...@monjalon.net>; > > > > > > > Andrew Rybchenko <andrew.rybche...@oktetlabs.ru> > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx > > > > > > > queue > > > > > > > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li > > > > > > > <xuemi...@nvidia.com> wrote: > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > From: Jerin Jacob <jerinjac...@gmail.com> > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM > > > > > > > > > To: Xueming(Steven) Li <xuemi...@nvidia.com> > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit > > > > > > > > > <ferruh.yi...@intel.com>; > > > > > > > > > NBU-Contact-Thomas Monjalon <tho...@monjalon.net>; Andrew > > > > > > > > > Rybchenko > > > > > > > > > <andrew.rybche...@oktetlabs.ru> > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared > > > > > > > > > Rx queue > > > > > > > > > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li > > > > > > > > > <xuemi...@nvidia.com> wrote: > > > > > > > > > > > > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded with > > > > > > > > > > mbufs > > > > > > > > > > for incoming packets. When number of representors scale out > > > > > > > > > > in a > > > > > > > > > > switch domain, the memory consumption became significant. > > > > > > > > > > Most > > > > > > > > > > important, polling all ports leads to high cache miss, high > > > > > > > > > > latency and low throughput. > > > > > > > > > > > > > > > > > > > > This patch introduces shared RX queue. Ports with same > > > > > > > > > > configuration in a switch domain could share RX queue set > > > > > > > > > > by specifying sharing group. > > > > > > > > > > Polling any queue using same shared RX queue receives > > > > > > > > > > packets from > > > > > > > > > > all member ports. Source port is identified by mbuf->port. > > > > > > > > > > > > > > > > > > > > Port queue number in a shared group should be identical. > > > > > > > > > > Queue > > > > > > > > > > index is > > > > > > > > > > 1:1 mapped in shared group. > > > > > > > > > > > > > > > > > > > > Share RX queue is supposed to be polled on same thread. > > > > > > > > > > > > > > > > > > > > Multiple groups is supported by group ID. > > > > > > > > > > > > > > > > > > Is this offload specific to the representor? If so can this > > > > > > > > > name be changed specifically to representor? > > > > > > > > > > > > > > > > Yes, PF and representor in switch domain could take advantage. > > > > > > > > > > > > > > > > > If it is for a generic case, how the flow ordering will be > > > > > > > > > maintained? > > > > > > > > > > > > > > > > Not quite sure that I understood your question. The control > > > > > > > > path of is > > > > > > > > almost same as before, PF and representor port still needed, > > > > > > > > rte flows not impacted. > > > > > > > > Queues still needed for each member port, descriptors(mbuf) > > > > > > > > will be > > > > > > > > supplied from shared Rx queue in my PMD implementation. > > > > > > > > > > > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ > > > > > > > offload, multiple ethdev receive queues land into the same > > > > > > > receive queue, In that case, how the flow order is maintained for > > > > > > > respective receive queues. > > > > > > > > > > > > I guess the question is testpmd forward stream? The forwarding > > > > > > logic has to be changed slightly in case of shared rxq. > > > > > > basically for each packet in rx_burst result, lookup source stream > > > > > > according to mbuf->port, forwarding to target fs. > > > > > > Packets from same source port could be grouped as a small burst to > > > > > > process, this will accelerates the performance if traffic come from > > > > > > limited ports. I'll introduce some common api to do shard rxq > > > > > > forwarding, call it with packets handling callback, so it suites for > > > > > > all forwarding engine. Will sent patches soon. > > > > > > > > > > > > > > > > All ports will put the packets in to the same queue (share queue), > > > > > right? Does > > > > > this means only single core will poll only, what will happen if there > > > > > are > > > > > multiple cores polling, won't it cause problem? > > > > > > > > > > And if this requires specific changes in the application, I am not > > > > > sure about > > > > > the solution, can't this work in a transparent way to the application? > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports > > > > in same group into one new port. Users could schedule polling on the > > > > aggregated port instead of all member ports. > > > > > > The v3 still has testpmd changes in fastpath. Right? IMO, For this > > > feature, we should not change fastpath of testpmd > > > application. Instead, testpmd can use aggregated ports probably as > > > separate fwd_engine to show how to use this feature. > > > > Good point to discuss :) There are two strategies to polling a shared > > Rxq: > > 1. polling each member port > > All forwarding engines can be reused to work as before. > > My testpmd patches are efforts towards this direction. > > Does your PMD support this? > > Not unfortunately. More than that, every application needs to change > to support this model.
Both strategies need user application to resolve port ID from mbuf and process accordingly. This one doesn't demand aggregated port, no polling schedule change. > > > 2. polling aggregated port > > Besides forwarding engine, need more work to to demo it. > > This is an optional API, not supported by my PMD yet. > > We are thinking of implementing this PMD when it comes to it, ie. > without application change in fastpath > logic. Fastpath have to resolve port ID anyway and forwarding according to logic. Forwarding engine need to adapt to support shard Rxq. Fortunately, in testpmd, this can be done with an abstract API. Let's defer part 2 until some PMD really support it and tested, how do you think? > > > > > > > > > > > > > > > > > > > > > > Overall, is this for optimizing memory for the port represontors? If > > > > > so can't we > > > > > have a port representor specific solution, reducing scope can reduce > > > > > the > > > > > complexity it brings? > > > > > > > > > > > > If this offload is only useful for representor case, Can we make > > > > > > > this offload specific to representor the case by changing its > > > > > > > name and > > > > > > > scope. > > > > > > > > > > > > It works for both PF and representors in same switch domain, for > > > > > > application like OVS, few changes to apply. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Xueming Li <xuemi...@nvidia.com> > > > > > > > > > > --- > > > > > > > > > > doc/guides/nics/features.rst | 11 > > > > > > > > > > +++++++++++ > > > > > > > > > > doc/guides/nics/features/default.ini | 1 + > > > > > > > > > > doc/guides/prog_guide/switch_representation.rst | 10 > > > > > > > > > > ++++++++++ > > > > > > > > > > lib/ethdev/rte_ethdev.c | 1 + > > > > > > > > > > lib/ethdev/rte_ethdev.h | 7 > > > > > > > > > > +++++++ > > > > > > > > > > 5 files changed, 30 insertions(+) > > > > > > > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst > > > > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 > > > > > > > > > > 100644 > > > > > > > > > > --- a/doc/guides/nics/features.rst > > > > > > > > > > +++ b/doc/guides/nics/features.rst > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum. > > > > > > > > > > > > > > > > > > > > ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue: > > > > > > > > > > + > > > > > > > > > > +Shared Rx queue > > > > > > > > > > +--------------- > > > > > > > > > > + > > > > > > > > > > +Supports shared Rx queue for ports in same switch domain. > > > > > > > > > > + > > > > > > > > > > +* **[uses] rte_eth_rxconf,rte_eth_rxmode**: > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``. > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``. > > > > > > > > > > + > > > > > > > > > > + > > > > > > > > > > .. _nic_features_packet_type_parsing: > > > > > > > > > > > > > > > > > > > > Packet type parsing > > > > > > > > > > diff --git a/doc/guides/nics/features/default.ini > > > > > > > > > > b/doc/guides/nics/features/default.ini > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644 > > > > > > > > > > --- a/doc/guides/nics/features/default.ini > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand = > > > > > > > > > > Queue start/stop = > > > > > > > > > > Runtime Rx queue setup = > > > > > > > > > > Runtime Tx queue setup = > > > > > > > > > > +Shared Rx queue = > > > > > > > > > > Burst mode info = > > > > > > > > > > Power mgmt address monitor = > > > > > > > > > > MTU update = > > > > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst > > > > > > > > > > b/doc/guides/prog_guide/switch_representation.rst > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644 > > > > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst > > > > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" > > > > > > > > > > front-end for applications. > > > > > > > > > > .. [1] `Ethernet switch device driver model (switchdev) > > > > > > > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt > > > > > > > > > > > `_ > > > > > > > > > > > > > > > > > > > > +- Memory usage of representors is huge when number of > > > > > > > > > > representor > > > > > > > > > > +grows, > > > > > > > > > > + because PMD always allocate mbuf for each descriptor of > > > > > > > > > > Rx queue. > > > > > > > > > > + Polling the large number of ports brings more CPU load, > > > > > > > > > > cache > > > > > > > > > > +miss and > > > > > > > > > > + latency. Shared Rx queue can be used to share Rx queue > > > > > > > > > > between > > > > > > > > > > +PF and > > > > > > > > > > + representors in same switch domain. > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ`` > > > > > > > > > > + is present in Rx offloading capability of device info. > > > > > > > > > > Setting > > > > > > > > > > +the > > > > > > > > > > + offloading flag in device Rx mode or Rx queue > > > > > > > > > > configuration to > > > > > > > > > > +enable > > > > > > > > > > + shared Rx queue. Polling any member port of shared Rx > > > > > > > > > > queue can > > > > > > > > > > +return > > > > > > > > > > + packets of all ports in group, port ID is saved in > > > > > > > > > > ``mbuf.port``. > > > > > > > > > > + > > > > > > > > > > Basic SR-IOV > > > > > > > > > > ------------ > > > > > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c > > > > > > > > > > b/lib/ethdev/rte_ethdev.c > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644 > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct { > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM), > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH), > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT), > > > > > > > > > > + RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ), > > > > > > > > > > }; > > > > > > > > > > > > > > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h > > > > > > > > > > b/lib/ethdev/rte_ethdev.h > > > > > > > > > > index d2b27c351f..a578c9db9d 100644 > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf { > > > > > > > > > > uint8_t rx_drop_en; /**< Drop packets if no > > > > > > > > > > descriptors are available. */ > > > > > > > > > > uint8_t rx_deferred_start; /**< Do not start queue > > > > > > > > > > with rte_eth_dev_start(). */ > > > > > > > > > > uint16_t rx_nseg; /**< Number of descriptions in > > > > > > > > > > rx_seg array. > > > > > > > > > > */ > > > > > > > > > > + uint32_t shared_group; /**< Shared port group index > > > > > > > > > > in > > > > > > > > > > + switch domain. */ > > > > > > > > > > /** > > > > > > > > > > * Per-queue Rx offloads to be set using > > > > > > > > > > DEV_RX_OFFLOAD_* flags. > > > > > > > > > > * Only offloads set on rx_queue_offload_capa or > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf { > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 0x00040000 > > > > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH 0x00080000 > > > > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000 > > > > > > > > > > +/** > > > > > > > > > > + * Rx queue is shared among ports in same switch domain to > > > > > > > > > > save > > > > > > > > > > +memory, > > > > > > > > > > + * avoid polling each port. Any port in group can be used > > > > > > > > > > to receive packets. > > > > > > > > > > + * Real source port number saved in mbuf->port field. > > > > > > > > > > + */ > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ 0x00200000 > > > > > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM > > > > > > > > > > | \ > > > > > > > > > > DEV_RX_OFFLOAD_UDP_CKSUM | > > > > > > > > > > \ > > > > > > > > > > -- > > > > > > > > > > 2.25.1 > > > > > > > > > > > > > > > > > > > > > > >