On Tue, 2021-10-12 at 14:28 +0300, Andrew Rybchenko wrote:
> On 10/12/21 1:55 PM, Xueming(Steven) Li wrote:
> > On Tue, 2021-10-12 at 11:48 +0300, Andrew Rybchenko wrote:
> > > On 10/12/21 9:37 AM, Xueming(Steven) Li wrote:
> > > > On Mon, 2021-10-11 at 23:11 +0800, Xueming Li wrote:
> > > > > On Mon, 2021-10-11 at 14:49 +0300, Andrew Rybchenko wrote:
> > > > > > Hi Xueming,
> > > > > > 
> > > > > > On 9/30/21 5:55 PM, Xueming Li wrote:
> > > > > > > In current DPDK framework, all RX queues is pre-loaded with mbufs 
> > > > > > > for
> > > > > > > incoming packets. When number of representors scale out in a 
> > > > > > > switch
> > > > > > > domain, the memory consumption became significant. Further more,
> > > > > > > polling all ports leads to high cache miss, high latency and low
> > > > > > > throughputs.
> > > > > > >  
> > > > > > > This patch introduces shared RX queue. PF and representors with 
> > > > > > > same
> > > > > > > configuration in same switch domain could share RX queue set by
> > > > > > > specifying shared Rx queue offloading flag and sharing group.
> > > > > > > 
> > > > > > > All ports that Shared Rx queue actually shares One Rx queue and 
> > > > > > > only
> > > > > > > pre-load mbufs to one Rx queue, memory is saved.
> > > > > > > 
> > > > > > > Polling any queue using same shared RX queue receives packets 
> > > > > > > from all
> > > > > > > member ports. Source port is identified by mbuf->port.
> > > > > > > 
> > > > > > > Multiple groups is supported by group ID. Port queue number in a 
> > > > > > > shared
> > > > > > > group should be identical. Queue index is 1:1 mapped in shared 
> > > > > > > group.
> > > > > > > An example of polling two share groups:
> > > > > > >   core    group   queue
> > > > > > >   0       0       0
> > > > > > >   1       0       1
> > > > > > >   2       0       2
> > > > > > >   3       0       3
> > > > > > >   4       1       0
> > > > > > >   5       1       1
> > > > > > >   6       1       2
> > > > > > >   7       1       3
> > > > > > > 
> > > > > > > Shared RX queue must be polled on single thread or core. If both 
> > > > > > > PF0 and
> > > > > > > representor0 joined same share group, can't poll pf0rxq0 on core1 
> > > > > > > and
> > > > > > > rep0rxq0 on core2. Actually, polling one port within share group 
> > > > > > > is
> > > > > > > sufficient since polling any port in group will return packets 
> > > > > > > for any
> > > > > > > port in group.
> > > > > > 
> > > > > > I apologize that I jump in into the review process that late.
> > > > > 
> > > > > Appreciate the bold suggestion, never too late :)
> > > > > 
> > > > > > 
> > > > > > Frankly speaking I doubt that it is the best design to solve
> > > > > > the problem. Yes, I confirm that the problem exists, but I
> > > > > > think there is better and simpler way to solve it.
> > > > > > 
> > > > > > The problem of the suggested solution is that it puts all
> > > > > > the headache about consistency to application and PMDs
> > > > > > without any help from ethdev layer to guarantee the
> > > > > > consistency. As the result I believe it will be either
> > > > > > missing/lost consistency checks or huge duplication in
> > > > > > each PMD which supports the feature. Shared RxQs must be
> > > > > > equally configured including number of queues, offloads
> > > > > > (taking device level Rx offloads into account), RSS
> > > > > > settings etc. So, applications must care about it and
> > > > > > PMDs (or ethdev layer) must check it.
> > > > > 
> > > > > The name might be confusing, here is my understanding:
> > > > > 1. NIC  shares the buffer supply HW Q between shared RxQs - for memory
> > > > > 2. PMD polls one shared RxQ - for latency and performance
> > > > > 3. Most per queue features like offloads and RSS not impacted. That's
> > > > > why this not mentioned. Some offloading might not being supported due
> > > > > to PMD or hw limitation, need to add check in PMD case by case.
> > > > > 4. Multiple group is defined for service level flexibility. For
> > > > > example, PF and VIP customer's load distributed via queues and 
> > > > > dedicate
> > > > > cores. Low priority customers share one core with one shared queue.
> > > > > multiple groups enables more combination.
> > > > > 5. One port could assign queues to different group for polling
> > > > > flexibility. For example first 4 queues in group 0 and next 4 queues 
> > > > > in
> > > > > group1, each group have other member ports with 4 queues, so the port
> > > > > with 8 queues could be polled with 8 cores w/o non-shared rxq penalty,
> > > > > in other words, each core only poll one shared RxQ.
> > > > > 
> > > > > > 
> > > > > > The advantage of the solution is that any device may
> > > > > > create group and subsequent devices join. Absence of
> > > > > > primary device is nice. But do we really need it?
> > > > > > Will the design work if some representors are configured
> > > > > > to use shared RxQ, but some do not? Theoretically it
> > > > > > is possible, but could require extra non-trivial code
> > > > > > on fast path.
> > > > > 
> > > > > If multiple groups, any device could be hot-unplugged.
> > > > > 
> > > > > Mixed configuration is supported, the only difference is how to set
> > > > > mbuf->port. Since group is per queue, mixed is better to be supported,
> > > > > didn't see any difficulty here.
> > > > > 
> > > > > PDM could select to support only group 0, same settings for each rxq,
> > > > > that fits most scenario.
> > > > > 
> > > > > > 
> > > > > > Also looking at the first two patch I don't understand
> > > > > > how application will find out which devices may share
> > > > > > RxQs. E.g. if we have two difference NICs which support
> > > > > > sharing, we can try to setup only one group 0, but
> > > > > > finally will have two devices (not one) which must be
> > > > > > polled.
> > > > > > 
> > > > > > 1. We need extra flag in dev_info->dev_capa
> > > > > >    RTE_ETH_DEV_CAPA_RX_SHARE to advertise that
> > > > > >    the device supports Rx sharing.
> > > > > 
> > > > > dev_info->rx_queue_offload_capa could be used here, no?
> > > 
> > > It depends. But we definitely need a flag which
> > > says that below rx_domain makes sense. It could be
> > > either RTE_ETH_DEV_CAPA_RX_SHARE or an Rx offload
> > > capability.
> > > 
> > > The question is if it is really an offload. The offload is
> > > when something could be done by HW/FW and result is provided
> > > to SW. May be it is just a nit picking...
> > > 
> > > May be we don't need an offload at all. Just have
> > > RTE_ETH_DEV_CAPA_RXQ_SHARE and use non-zero group ID
> > > as a flag that an RxQ should be shared (zero - default,
> > > no sharing). ethdev layer may check consistency on
> > > its layer to ensure that the device capability is
> > > reported if non-zero group is specified on queue setup.
> > > 
> > > > > 
> > > > > > 
> > > > > > 2. I think we need "rx_domain" in device info
> > > > > >    (which should be treated in boundaries of the
> > > > > >    switch_domain) if and only if
> > > > > >    RTE_ETH_DEV_CAPA_RX_SHARE is advertised.
> > > > > >    Otherwise rx_domain value does not make sense.
> > > > > 
> > > > > I see, this will give flexibility of different hw, will add it.
> > > > > 
> > > > > > 
> > > > > > (1) and (2) will allow application to find out which
> > > > > > devices can share Rx.
> > > > > > 
> > > > > > 3. Primary device (representors backing device) should
> > > > > >    advertise shared RxQ offload. Enabling of the offload
> > > > > >    tells the device to provide packets to all device in
> > > > > >    the Rx domain with mbuf->port filled in appropriately.
> > > > > >    Also it allows app to identify primary device in the
> > > > > >    Rx domain. When application enables the offload, it
> > > > > >    must ensure that it does not treat used port_id as an
> > > > > >    input port_id, but always check mbuf->port for each
> > > > > >    packet.
> > > > > > 
> > > > > > 4. A new Rx mode should be introduced for secondary
> > > > > >    devices. It should not allow to configure RSS, specify
> > > > > >    any Rx offloads etc. ethdev must ensure it.
> > > > > >    It is an open question right now if it should require
> > > > > >    to provide primary port_id. In theory representors
> > > > > >    have it. However, may be it is nice for consistency
> > > > > >    to ensure that application knows that it does.
> > > > > >    If shared Rx mode is specified for device, application
> > > > > >    does not need to setup RxQs and attempts to do it
> > > > > >    should be discarded in ethdev.
> > > > > >    For consistency it is better to ensure that number of
> > > > > >    queues match.
> > > > > 
> > > > > RSS and Rx offloads should be supported as individual, PMD needs to
> > > > > check if not supported.
> > > 
> > > Thinking a bit more about it I agree that RSS settings could
> > > be individual. Offload could be individual as well, but I'm
> > > not sure about all offloads. E.g. Rx scatter which is related
> > > to Rx buffer size (which is shared since Rx mempool is shared)
> > > vs MTU. May be it is acceptable. We just must define rules
> > > what should happen if offloads contradict to each other.
> > > It should be highlighted in the description including
> > > driver callback to ensure that PMD maintainers are responsible
> > > for consistency checks.
> > > 
> > > > > 
> > > > > >    It is an interesting question what should happen if
> > > > > >    primary device is reconfigured and shared Rx is
> > > > > >    disabled on reconfiguration.
> > > > > 
> > > > > I feel better no primary port/queue assumption in configuration, all
> > > > > members are equally treated, each queue can join or quit share group,
> > > > > that's important to support multiple groups.
> > > 
> > > I agree. The problem of many flexible solutions is
> > > complexity to support. We'll see how it goes.
> > > 
> > > > > 
> > > > > > 
> > > > > > 5. If so, in theory implementation of the Rx burst
> > > > > >    in the secondary could simply call Rx burst on
> > > > > >    primary device.
> > > > > > 
> > > > > > Andrew.
> > > > > 
> > > > 
> > > > Hi Andrew,
> > > > 
> > > > I realized that we are talking different things, this feature
> > > > introduced 2 RxQ share:
> > > > 1. Share mempool to save memory
> > > > 2. Share polling to save latency
> > > > 
> > > > What you suggested is reuse all RxQ configuration IIUC, maybe we should
> > > > break the flag into 3, so application could learn PMD capability and
> > > > configure accordingly, how do you think?
> > > > RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POOL
> > > 
> > > Not sure that I understand. Just specify the same mempool
> > > on Rx queue setup. Isn't it sufficient?
> > > 
> > > > RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POLL
> > > 
> > > It implies pool sharing if I'm not mistaken. Of course,
> > > we can pool many different HW queues in one poll, but it
> > > hardly makes sense to care specially about it.
> > > IMHO RxQ sharing is a sharing of the underlying HW Rx queue.
> > > 
> > > > RTE_ETH_RX_OFFLOAD_RXQ_SHARE_CFG //implies POOL and POLL
> > > 
> > > It is hardly a feature. Rather a possible limitation.
> > 
> > Thanks, then I'd drop this suggestion then.
> > 
> > Here is the TODO list, let me know if anything missing:
> > 1. change offload flag to RTE_ETH_DEV_CAPA_RX_SHARE
> 
> RTE_ETH_DEV_CAPA_RXQ_SHARE since it is not sharing of
> entire Rx, but just some queues.

OK

> 
> > 2. RxQ share group check in ethdev
> > 3. add rx_domain into device info

Seems rte_eth_switch_info is better palce for rx_domain.

> > 
> > > 
> > > Andrew.
> > 
> 

Reply via email to