> -----Original Message----- > From: Jerin Jacob <jerinjac...@gmail.com> > Sent: Monday, August 30, 2021 5:31 PM > To: Xueming(Steven) Li <xuemi...@nvidia.com> > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yi...@intel.com>; > NBU-Contact-Thomas Monjalon <tho...@monjalon.net>; > Andrew Rybchenko <andrew.rybche...@oktetlabs.ru> > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue > > On Sat, Aug 28, 2021 at 7:46 PM Xueming(Steven) Li <xuemi...@nvidia.com> > wrote: > > > > > > > > > -----Original Message----- > > > From: Jerin Jacob <jerinjac...@gmail.com> > > > Sent: Thursday, August 26, 2021 7:58 PM > > > To: Xueming(Steven) Li <xuemi...@nvidia.com> > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yi...@intel.com>; > > > NBU-Contact-Thomas Monjalon <tho...@monjalon.net>; Andrew Rybchenko > > > <andrew.rybche...@oktetlabs.ru> > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue > > > > > > On Thu, Aug 19, 2021 at 5:39 PM Xueming(Steven) Li <xuemi...@nvidia.com> > > > wrote: > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Jerin Jacob <jerinjac...@gmail.com> > > > > > Sent: Thursday, August 19, 2021 1:27 PM > > > > > To: Xueming(Steven) Li <xuemi...@nvidia.com> > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit > > > > > <ferruh.yi...@intel.com>; NBU-Contact-Thomas Monjalon > > > > > <tho...@monjalon.net>; Andrew Rybchenko > > > > > <andrew.rybche...@oktetlabs.ru> > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue > > > > > > > > > > On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li > > > > > <xuemi...@nvidia.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Jerin Jacob <jerinjac...@gmail.com> > > > > > > > Sent: Tuesday, August 17, 2021 11:12 PM > > > > > > > To: Xueming(Steven) Li <xuemi...@nvidia.com> > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit > > > > > > > <ferruh.yi...@intel.com>; NBU-Contact-Thomas Monjalon > > > > > > > <tho...@monjalon.net>; Andrew Rybchenko > > > > > > > <andrew.rybche...@oktetlabs.ru> > > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx > > > > > > > queue > > > > > > > > > > > > > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li > > > > > > > <xuemi...@nvidia.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > From: Jerin Jacob <jerinjac...@gmail.com> > > > > > > > > > Sent: Tuesday, August 17, 2021 5:33 PM > > > > > > > > > To: Xueming(Steven) Li <xuemi...@nvidia.com> > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit > > > > > > > > > <ferruh.yi...@intel.com>; NBU-Contact-Thomas Monjalon > > > > > > > > > <tho...@monjalon.net>; Andrew Rybchenko > > > > > > > > > <andrew.rybche...@oktetlabs.ru> > > > > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared > > > > > > > > > Rx queue > > > > > > > > > > > > > > > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li > > > > > > > > > <xuemi...@nvidia.com> wrote: > > > > > > > > > > > > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded > > > > > > > > > > with mbufs for incoming packets. When number of > > > > > > > > > > representors scale out in a switch domain, the memory > > > > > > > > > > consumption became significant. Most important, > > > > > > > > > > polling all ports leads to high cache miss, high latency > > > > > > > > > > and low throughput. > > > > > > > > > > > > > > > > > > > > This patch introduces shared RX queue. Ports with same > > > > > > > > > > configuration in a switch domain could share RX queue set > > > > > > > > > > by specifying sharing group. > > > > > > > > > > Polling any queue using same shared RX queue receives > > > > > > > > > > packets from all member ports. Source port is identified by > > > > > > > > > > mbuf->port. > > > > > > > > > > > > > > > > > > > > Port queue number in a shared group should be identical. > > > > > > > > > > Queue index is > > > > > > > > > > 1:1 mapped in shared group. > > > > > > > > > > > > > > > > > > > > Share RX queue must be polled on single thread or core. > > > > > > > > > > > > > > > > > > > > Multiple groups is supported by group ID. > > > > > > > > > > > > > > > > > > > > Signed-off-by: Xueming Li <xuemi...@nvidia.com> > > > > > > > > > > Cc: Jerin Jacob <jerinjac...@gmail.com> > > > > > > > > > > --- > > > > > > > > > > Rx queue object could be used as shared Rx queue > > > > > > > > > > object, it's important to clear all queue control callback > > > > > > > > > > api that using queue object: > > > > > > > > > > > > > > > > > > > > https://mails.dpdk.org/archives/dev/2021-July/215574.h > > > > > > > > > > tml > > > > > > > > > > > > > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR diff --git > > > > > > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h > > > > > > > > > > index d2b27c351f..a578c9db9d 100644 > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf { > > > > > > > > > > uint8_t rx_drop_en; /**< Drop packets if no > > > > > > > > > > descriptors are available. */ > > > > > > > > > > uint8_t rx_deferred_start; /**< Do not start queue > > > > > > > > > > with rte_eth_dev_start(). */ > > > > > > > > > > uint16_t rx_nseg; /**< Number of descriptions in > > > > > > > > > > rx_seg array. > > > > > > > > > > */ > > > > > > > > > > + uint32_t shared_group; /**< Shared port group > > > > > > > > > > + index in switch domain. */ > > > > > > > > > > > > > > > > > > Not to able to see anyone setting/creating this group ID test > > > > > > > > > application. > > > > > > > > > How this group is created? > > > > > > > > > > > > > > > > Nice catch, the initial testpmd version only support one > > > > > > > > default group(0). > > > > > > > > All ports that supports shared-rxq assigned in same group. > > > > > > > > > > > > > > > > We should be able to change "--rxq-shared" to > > > > > > > > "--rxq-shared-group" > > > > > > > > to support group other than default. > > > > > > > > > > > > > > > > To support more groups simultaneously, need to consider > > > > > > > > testpmd forwarding stream core assignment, all streams in same > > > > > > > > group need to stay on same core. > > > > > > > > It's possible to specify how many ports to increase group > > > > > > > > number, but user must schedule stream affinity carefully - > > > > > > > > error prone. > > > > > > > > > > > > > > > > On the other hand, one group should be sufficient for most > > > > > > > > customer, the doubt is whether it valuable to support multiple > > > > > > > > groups test. > > > > > > > > > > > > > > Ack. One group is enough in testpmd. > > > > > > > > > > > > > > My question was more about who and how this group is > > > > > > > created, Should n't we need API to create shared_group? If > > > > > > > we do the following, at least, I can think, how it can be > > > > > > > implemented in SW > > > or other HW. > > > > > > > > > > > > > > - Create aggregation queue group > > > > > > > - Attach multiple Rx queues to the aggregation queue group > > > > > > > - Pull the packets from the queue group(which internally > > > > > > > fetch from the Rx queues _attached_) > > > > > > > > > > > > > > Does the above kind of sequence, break your representor use case? > > > > > > > > > > > > Seems more like a set of EAL wrapper. Current API tries to minimize > > > > > > the application efforts to adapt shared-rxq. > > > > > > - step 1, not sure how important it is to create group with API, in > > > > > > rte_flow, group is created on demand. > > > > > > > > > > Which rte_flow pattern/action for this? > > > > > > > > No rte_flow for this, just recalled that the group in rte_flow is not > > > > created along with flow, not via api. > > > > I don’t see anything else to create along with group, just double > > > > whether it valuable to introduce a new api set to manage group. > > > > > > See below. > > > > > > > > > > > > > > > > > > - step 2, currently, the attaching is done in > > > > > > rte_eth_rx_queue_setup, specify offload and group in rx_conf struct. > > > > > > - step 3, define a dedicate api to receive packets from shared rxq? > > > > > > Looks clear to receive packets from shared rxq. > > > > > > currently, rxq objects in share group is same - the shared rxq, > > > > > > so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could > > > > > > be used to receive packets from any ports in group, normally the > > > > > > first port(PF) in group. > > > > > > An alternative way is defining a vdev with same queue number and > > > > > > copy rxq objects will make the vdev a proxy of > > > > > > the shared rxq group - this could be an helper API. > > > > > > > > > > > > Anyway the wrapper doesn't break use case, step 3 api is more > > > > > > clear, need to understand how to implement efficiently. > > > > > > > > > > Are you doing this feature based on any HW support or it just > > > > > pure SW thing, If it is SW, It is better to have just new vdev > > > > > for like drivers/net/bonding/. This we can help aggregate > > > > > multiple Rxq across > > > the multiple ports of same the driver. > > > > > > > > Based on HW support. > > > > > > In Marvel HW, we do some support, I will outline here and some queries on > > > this. > > > > > > # We need to create some new HW structure for aggregation # Connect > > > each Rxq to the new HW structure for aggregation # Use rx_burst from the > > > new HW structure. > > > > > > Could you outline your HW support? > > > > > > Also, I am not able to understand how this will reduce the memory, > > > atleast in our HW need creating more memory now to deal this as we need > > > to deal new HW structure. > > > > > > How is in your HW it reduces the memory? Also, if memory is the > > > constraint, why NOT reduce the number of queues. > > > > > > > Glad to know that Marvel is working on this, what's the status of driver > > implementation? > > > > In my PMD implementation, it's very similar, a new HW object shared memory > > pool is created to replace per rxq memory pool. > > Legacy rxq feed queue with allocated mbufs as number of descriptors, > > now shared rxqs share the same pool, no need to supply mbufs for each rxq, > > just feed the shared rxq. > > > > So the memory saving reflects to mbuf per rxq, even 1000 representors in > > shared rxq group, the mbufs consumed is one rxq. > > In other words, new members in shared rxq doesn’t allocate new mbufs to > > feed rxq, just share with existing shared rxq(HW > mempool). > > The memory required to setup each rxq doesn't change too much, agree. > > We can ask the application to configure the same mempool for multiple RQ too. > RIght? If the saving is based on sharing the mempool > with multiple RQs.
Yes, using the same mempool is the fundamental. The difference is how many mbufs allocate from pool. Assuming 512 descriptors perf rxq and 4 rxqs per device, it's 2.3K(mbuf) * 512 * 4 = 4.6M / device To support 1000 representors, need a 4.6G mempool :) For shared rxq, only 4.6M(one device) mbufs allocate from mempool, they are shared for all rxqs in group. > > > > > > # Also, I was thinking, one way to avoid the fast path or ABI change > > > would like. > > > > > > # Driver Initializes one more eth_dev_ops in driver as aggregator > > > ethdev # devargs of new ethdev or specific API like > > > drivers/net/bonding/rte_eth_bond.h can take the argument (port, queue) > > > tuples which needs to aggregate by new ethdev port # > No change in fastpath or ABI is required in this model. > > > > > > > This could be an option to access shared rxq. What's the difference of the > > new PMD? > > No ABI and fast change are required. > > > What's the difference of PMD driver to create the new device? > > > > Is it important in your implementation? Does it work with existing rx_burst > > api? > > Yes . It will work with the existing rx_burst API. > > > > > > > > > > > > > Most user might uses PF in group as the anchor port to rx burst, > > > > current definition should be easy for them to migrate. > > > > but some user might prefer grouping some hot > > > > plug/unpluggedrepresentors, EAL could provide wrappers, users could do > > > > that either due to the strategy not complex enough. > > > Anyway, welcome any suggestion. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /** > > > > > > > > > > * Per-queue Rx offloads to be set using > > > > > > > > > > DEV_RX_OFFLOAD_* flags. > > > > > > > > > > * Only offloads set on rx_queue_offload_capa > > > > > > > > > > or rx_offload_capa @@ -1373,6 +1374,12 @@ struct > > > > > > > > > > rte_eth_conf { #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM > > > > > > > > > > 0x00040000 > > > > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH 0x00080000 > > > > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000 > > > > > > > > > > +/** > > > > > > > > > > + * Rx queue is shared among ports in same switch > > > > > > > > > > +domain to save memory, > > > > > > > > > > + * avoid polling each port. Any port in group can be used > > > > > > > > > > to receive packets. > > > > > > > > > > + * Real source port number saved in mbuf->port field. > > > > > > > > > > + */ > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ 0x00200000 > > > > > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM > > > > > > > > > > | \ > > > > > > > > > > > > > > > > > > > > DEV_RX_OFFLOAD_UDP_CKSUM > > > > > > > > > > | \ > > > > > > > > > > -- > > > > > > > > > > 2.25.1 > > > > > > > > > >