> -----Original Message----- > From: dev <dev-boun...@dpdk.org> On Behalf Of Xueming(Steven) Li > Sent: Wednesday, August 11, 2021 8:59 PM > To: Ferruh Yigit <ferruh.yi...@intel.com>; Jerin Jacob <jerinjac...@gmail.com> > Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon > <tho...@monjalon.net>; Andrew Rybchenko > <andrew.rybche...@oktetlabs.ru> > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue > > > > > -----Original Message----- > > From: Ferruh Yigit <ferruh.yi...@intel.com> > > Sent: Wednesday, August 11, 2021 8:04 PM > > To: Xueming(Steven) Li <xuemi...@nvidia.com>; Jerin Jacob > > <jerinjac...@gmail.com> > > Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon > > <tho...@monjalon.net>; Andrew Rybchenko > > <andrew.rybche...@oktetlabs.ru> > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote: > > > > > > > > >> -----Original Message----- > > >> From: Jerin Jacob <jerinjac...@gmail.com> > > >> Sent: Wednesday, August 11, 2021 4:03 PM > > >> To: Xueming(Steven) Li <xuemi...@nvidia.com> > > >> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yi...@intel.com>; > > >> NBU-Contact-Thomas Monjalon <tho...@monjalon.net>; Andrew Rybchenko > > >> <andrew.rybche...@oktetlabs.ru> > > >> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx > > >> queue > > >> > > >> On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemi...@nvidia.com> > > >> wrote: > > >>> > > >>> Hi, > > >>> > > >>>> -----Original Message----- > > >>>> From: Jerin Jacob <jerinjac...@gmail.com> > > >>>> Sent: Monday, August 9, 2021 9:51 PM > > >>>> To: Xueming(Steven) Li <xuemi...@nvidia.com> > > >>>> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit > > >>>> <ferruh.yi...@intel.com>; NBU-Contact-Thomas Monjalon > > >>>> <tho...@monjalon.net>; Andrew Rybchenko > > >>>> <andrew.rybche...@oktetlabs.ru> > > >>>> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx > > >>>> queue > > >>>> > > >>>> On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemi...@nvidia.com> wrote: > > >>>>> > > >>>>> In current DPDK framework, each RX queue is pre-loaded with > > >>>>> mbufs for incoming packets. When number of representors scale > > >>>>> out in a switch domain, the memory consumption became > > >>>>> significant. Most important, polling all ports leads to high > > >>>>> cache miss, high latency and low throughput. > > >>>>> > > >>>>> This patch introduces shared RX queue. Ports with same > > >>>>> configuration in a switch domain could share RX queue set by > > >>>>> specifying sharing group. > > >>>>> Polling any queue using same shared RX queue receives packets > > >>>>> from all member ports. Source port is identified by mbuf->port. > > >>>>> > > >>>>> Port queue number in a shared group should be identical. Queue > > >>>>> index is > > >>>>> 1:1 mapped in shared group. > > >>>>> > > >>>>> Share RX queue is supposed to be polled on same thread. > > >>>>> > > >>>>> Multiple groups is supported by group ID. > > >>>> > > >>>> Is this offload specific to the representor? If so can this name be > > >>>> changed specifically to representor? > > >>> > > >>> Yes, PF and representor in switch domain could take advantage. > > >>> > > >>>> If it is for a generic case, how the flow ordering will be maintained? > > >>> > > >>> Not quite sure that I understood your question. The control path > > >>> of is almost same as before, PF and representor port still needed, rte > > >>> flows not impacted. > > >>> Queues still needed for each member port, descriptors(mbuf) will > > >>> be supplied from shared Rx queue in my PMD implementation. > > >> > > >> My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ > > >> offload, multiple ethdev receive queues land into the same receive > > >> queue, In that case, how the flow order is maintained for > > respective receive queues. > > > > > > I guess the question is testpmd forward stream? The forwarding logic has > > > to be changed slightly in case of shared rxq. > > > basically for each packet in rx_burst result, lookup source stream > > > according to mbuf->port, forwarding to target fs. > > > Packets from same source port could be grouped as a small burst to > > > process, this will accelerates the performance if traffic come from > > > limited ports. I'll introduce some common api to do shard rxq > > > forwarding, call it with packets handling callback, so it suites for > > > all > > forwarding engine. Will sent patches soon. > > > > > > > All ports will put the packets in to the same queue (share queue), > > right? Does this means only single core will poll only, what will happen if > > there are multiple cores polling, won't it cause problem? > > This has been mentioned in commit log, the shared rxq is supposed to be > polling in single thread(core) - I think it should be "MUST". > Result is unexpected if there are multiple cores pooling, that's why I added > a polling schedule check in testpmd.
V2 with testpmd code uploaded, please check. > Similar for rx/tx burst function, a queue can't be polled on multiple > thread(core), and for performance concern, no such check in eal > api. > > If users want to utilize multiple cores to distribute workloads, it's > possible to define more groups, queues in different group could be > could be polled on multiple cores. > > It's possible to poll every member port in group, but not necessary, any port > in group could be polled to get packets for all ports in > group. > > If the member port subject to hot plug/remove, it's possible to create a > vdev with same queue number, copy rxq object and poll vdev > as a dedicate proxy for the group. > > > > > And if this requires specific changes in the application, I am not > > sure about the solution, can't this work in a transparent way to the > > application? > > Yes, we considered different options in design stage. One possible solution > is to cache received packets in rings, this can be done on > eth layer, but I'm afraid less benefits, user still has to be a ware of > multiple core polling. > This can be done as a wrapper PMD later, more efforts. > > > > > Overall, is this for optimizing memory for the port represontors? If > > so can't we have a port representor specific solution, reducing scope can > > reduce the complexity it brings? > > This feature supports both PF and representor, and yes, major issue is memory > of representors. Poll all representors also introduces > more core cache miss latency. This feature essentially aggregates all ports > in group as one port. > On the other hand, it's useful for rte flow to create offloading flows using > representor as a regular port ID. > > It's great if any new solution/suggestion, my head buried in PMD code :) > > > > > >> If this offload is only useful for representor case, Can we make > > >> this offload specific to representor the case by changing its name and > > >> scope. > > > > > > It works for both PF and representors in same switch domain, for > > > application like OVS, few changes to apply. > > > > > >> > > >> > > >>> > > >>>> > > >>>>> > > >>>>> Signed-off-by: Xueming Li <xuemi...@nvidia.com> > > >>>>> --- > > >>>>> doc/guides/nics/features.rst | 11 +++++++++++ > > >>>>> doc/guides/nics/features/default.ini | 1 + > > >>>>> doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++ > > >>>>> lib/ethdev/rte_ethdev.c | 1 + > > >>>>> lib/ethdev/rte_ethdev.h | 7 +++++++ > > >>>>> 5 files changed, 30 insertions(+) > > >>>>> > > >>>>> diff --git a/doc/guides/nics/features.rst > > >>>>> b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 > > >>>>> 100644 > > >>>>> --- a/doc/guides/nics/features.rst > > >>>>> +++ b/doc/guides/nics/features.rst > > >>>>> @@ -624,6 +624,17 @@ Supports inner packet L4 checksum. > > >>>>> > > >>>>> ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``. > > >>>>> > > >>>>> > > >>>>> +.. _nic_features_shared_rx_queue: > > >>>>> + > > >>>>> +Shared Rx queue > > >>>>> +--------------- > > >>>>> + > > >>>>> +Supports shared Rx queue for ports in same switch domain. > > >>>>> + > > >>>>> +* **[uses] rte_eth_rxconf,rte_eth_rxmode**: > > >>>>> ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``. > > >>>>> +* **[provides] mbuf**: ``mbuf.port``. > > >>>>> + > > >>>>> + > > >>>>> .. _nic_features_packet_type_parsing: > > >>>>> > > >>>>> Packet type parsing > > >>>>> diff --git a/doc/guides/nics/features/default.ini > > >>>>> b/doc/guides/nics/features/default.ini > > >>>>> index 754184ddd4..ebeb4c1851 100644 > > >>>>> --- a/doc/guides/nics/features/default.ini > > >>>>> +++ b/doc/guides/nics/features/default.ini > > >>>>> @@ -19,6 +19,7 @@ Free Tx mbuf on demand = > > >>>>> Queue start/stop = > > >>>>> Runtime Rx queue setup = > > >>>>> Runtime Tx queue setup = > > >>>>> +Shared Rx queue = > > >>>>> Burst mode info = > > >>>>> Power mgmt address monitor = > > >>>>> MTU update = > > >>>>> diff --git a/doc/guides/prog_guide/switch_representation.rst > > >>>>> b/doc/guides/prog_guide/switch_representation.rst > > >>>>> index ff6aa91c80..45bf5a3a10 100644 > > >>>>> --- a/doc/guides/prog_guide/switch_representation.rst > > >>>>> +++ b/doc/guides/prog_guide/switch_representation.rst > > >>>>> @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end > > >>>>> for applications. > > >>>>> .. [1] `Ethernet switch device driver model (switchdev) > > >>>>> > > >>>>> <https://www.kernel.org/doc/Documentation/networking/switchdev.t > > >>>>> xt > > >>>>>> `_ > > >>>>> > > >>>>> +- Memory usage of representors is huge when number of > > >>>>> +representor grows, > > >>>>> + because PMD always allocate mbuf for each descriptor of Rx queue. > > >>>>> + Polling the large number of ports brings more CPU load, cache > > >>>>> +miss and > > >>>>> + latency. Shared Rx queue can be used to share Rx queue > > >>>>> +between PF and > > >>>>> + representors in same switch domain. > > >>>>> +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ`` > > >>>>> + is present in Rx offloading capability of device info. > > >>>>> +Setting the > > >>>>> + offloading flag in device Rx mode or Rx queue configuration > > >>>>> +to enable > > >>>>> + shared Rx queue. Polling any member port of shared Rx queue > > >>>>> +can return > > >>>>> + packets of all ports in group, port ID is saved in ``mbuf.port``. > > >>>>> + > > >>>>> Basic SR-IOV > > >>>>> ------------ > > >>>>> > > >>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c > > >>>>> index 9d95cd11e1..1361ff759a 100644 > > >>>>> --- a/lib/ethdev/rte_ethdev.c > > >>>>> +++ b/lib/ethdev/rte_ethdev.c > > >>>>> @@ -127,6 +127,7 @@ static const struct { > > >>>>> RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM), > > >>>>> RTE_RX_OFFLOAD_BIT2STR(RSS_HASH), > > >>>>> RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT), > > >>>>> + RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ), > > >>>>> }; > > >>>>> > > >>>>> #undef RTE_RX_OFFLOAD_BIT2STR > > >>>>> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h > > >>>>> index d2b27c351f..a578c9db9d 100644 > > >>>>> --- a/lib/ethdev/rte_ethdev.h > > >>>>> +++ b/lib/ethdev/rte_ethdev.h > > >>>>> @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf { > > >>>>> uint8_t rx_drop_en; /**< Drop packets if no descriptors are > > >>>>> available. */ > > >>>>> uint8_t rx_deferred_start; /**< Do not start queue with > > >>>>> rte_eth_dev_start(). */ > > >>>>> uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. > > >>>>> */ > > >>>>> + uint32_t shared_group; /**< Shared port group index in > > >>>>> + switch domain. */ > > >>>>> /** > > >>>>> * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* > > >>>>> flags. > > >>>>> * Only offloads set on rx_queue_offload_capa or > > >>>>> rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf { > > >>>>> #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 0x00040000 > > >>>>> #define DEV_RX_OFFLOAD_RSS_HASH 0x00080000 > > >>>>> #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000 > > >>>>> +/** > > >>>>> + * Rx queue is shared among ports in same switch domain to save > > >>>>> +memory, > > >>>>> + * avoid polling each port. Any port in group can be used to receive > > >>>>> packets. > > >>>>> + * Real source port number saved in mbuf->port field. > > >>>>> + */ > > >>>>> +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ 0x00200000 > > >>>>> > > >>>>> #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \ > > >>>>> DEV_RX_OFFLOAD_UDP_CKSUM | \ > > >>>>> -- > > >>>>> 2.25.1 > > >>>>>