On Tue, 2021-10-12 at 14:28 +0300, Andrew Rybchenko wrote: > On 10/12/21 1:55 PM, Xueming(Steven) Li wrote: > > On Tue, 2021-10-12 at 11:48 +0300, Andrew Rybchenko wrote: > > > On 10/12/21 9:37 AM, Xueming(Steven) Li wrote: > > > > On Mon, 2021-10-11 at 23:11 +0800, Xueming Li wrote: > > > > > On Mon, 2021-10-11 at 14:49 +0300, Andrew Rybchenko wrote: > > > > > > Hi Xueming, > > > > > > > > > > > > On 9/30/21 5:55 PM, Xueming Li wrote: > > > > > > > In current DPDK framework, all RX queues is pre-loaded with mbufs > > > > > > > for > > > > > > > incoming packets. When number of representors scale out in a > > > > > > > switch > > > > > > > domain, the memory consumption became significant. Further more, > > > > > > > polling all ports leads to high cache miss, high latency and low > > > > > > > throughputs. > > > > > > > > > > > > > > This patch introduces shared RX queue. PF and representors with > > > > > > > same > > > > > > > configuration in same switch domain could share RX queue set by > > > > > > > specifying shared Rx queue offloading flag and sharing group. > > > > > > > > > > > > > > All ports that Shared Rx queue actually shares One Rx queue and > > > > > > > only > > > > > > > pre-load mbufs to one Rx queue, memory is saved. > > > > > > > > > > > > > > Polling any queue using same shared RX queue receives packets > > > > > > > from all > > > > > > > member ports. Source port is identified by mbuf->port. > > > > > > > > > > > > > > Multiple groups is supported by group ID. Port queue number in a > > > > > > > shared > > > > > > > group should be identical. Queue index is 1:1 mapped in shared > > > > > > > group. > > > > > > > An example of polling two share groups: > > > > > > > core group queue > > > > > > > 0 0 0 > > > > > > > 1 0 1 > > > > > > > 2 0 2 > > > > > > > 3 0 3 > > > > > > > 4 1 0 > > > > > > > 5 1 1 > > > > > > > 6 1 2 > > > > > > > 7 1 3 > > > > > > > > > > > > > > Shared RX queue must be polled on single thread or core. If both > > > > > > > PF0 and > > > > > > > representor0 joined same share group, can't poll pf0rxq0 on core1 > > > > > > > and > > > > > > > rep0rxq0 on core2. Actually, polling one port within share group > > > > > > > is > > > > > > > sufficient since polling any port in group will return packets > > > > > > > for any > > > > > > > port in group. > > > > > > > > > > > > I apologize that I jump in into the review process that late. > > > > > > > > > > Appreciate the bold suggestion, never too late :) > > > > > > > > > > > > > > > > > Frankly speaking I doubt that it is the best design to solve > > > > > > the problem. Yes, I confirm that the problem exists, but I > > > > > > think there is better and simpler way to solve it. > > > > > > > > > > > > The problem of the suggested solution is that it puts all > > > > > > the headache about consistency to application and PMDs > > > > > > without any help from ethdev layer to guarantee the > > > > > > consistency. As the result I believe it will be either > > > > > > missing/lost consistency checks or huge duplication in > > > > > > each PMD which supports the feature. Shared RxQs must be > > > > > > equally configured including number of queues, offloads > > > > > > (taking device level Rx offloads into account), RSS > > > > > > settings etc. So, applications must care about it and > > > > > > PMDs (or ethdev layer) must check it. > > > > > > > > > > The name might be confusing, here is my understanding: > > > > > 1. NIC shares the buffer supply HW Q between shared RxQs - for memory > > > > > 2. PMD polls one shared RxQ - for latency and performance > > > > > 3. Most per queue features like offloads and RSS not impacted. That's > > > > > why this not mentioned. Some offloading might not being supported due > > > > > to PMD or hw limitation, need to add check in PMD case by case. > > > > > 4. Multiple group is defined for service level flexibility. For > > > > > example, PF and VIP customer's load distributed via queues and > > > > > dedicate > > > > > cores. Low priority customers share one core with one shared queue. > > > > > multiple groups enables more combination. > > > > > 5. One port could assign queues to different group for polling > > > > > flexibility. For example first 4 queues in group 0 and next 4 queues > > > > > in > > > > > group1, each group have other member ports with 4 queues, so the port > > > > > with 8 queues could be polled with 8 cores w/o non-shared rxq penalty, > > > > > in other words, each core only poll one shared RxQ. > > > > > > > > > > > > > > > > > The advantage of the solution is that any device may > > > > > > create group and subsequent devices join. Absence of > > > > > > primary device is nice. But do we really need it? > > > > > > Will the design work if some representors are configured > > > > > > to use shared RxQ, but some do not? Theoretically it > > > > > > is possible, but could require extra non-trivial code > > > > > > on fast path. > > > > > > > > > > If multiple groups, any device could be hot-unplugged. > > > > > > > > > > Mixed configuration is supported, the only difference is how to set > > > > > mbuf->port. Since group is per queue, mixed is better to be supported, > > > > > didn't see any difficulty here. > > > > > > > > > > PDM could select to support only group 0, same settings for each rxq, > > > > > that fits most scenario. > > > > > > > > > > > > > > > > > Also looking at the first two patch I don't understand > > > > > > how application will find out which devices may share > > > > > > RxQs. E.g. if we have two difference NICs which support > > > > > > sharing, we can try to setup only one group 0, but > > > > > > finally will have two devices (not one) which must be > > > > > > polled. > > > > > > > > > > > > 1. We need extra flag in dev_info->dev_capa > > > > > > RTE_ETH_DEV_CAPA_RX_SHARE to advertise that > > > > > > the device supports Rx sharing. > > > > > > > > > > dev_info->rx_queue_offload_capa could be used here, no? > > > > > > It depends. But we definitely need a flag which > > > says that below rx_domain makes sense. It could be > > > either RTE_ETH_DEV_CAPA_RX_SHARE or an Rx offload > > > capability. > > > > > > The question is if it is really an offload. The offload is > > > when something could be done by HW/FW and result is provided > > > to SW. May be it is just a nit picking... > > > > > > May be we don't need an offload at all. Just have > > > RTE_ETH_DEV_CAPA_RXQ_SHARE and use non-zero group ID > > > as a flag that an RxQ should be shared (zero - default, > > > no sharing). ethdev layer may check consistency on > > > its layer to ensure that the device capability is > > > reported if non-zero group is specified on queue setup. > > > > > > > > > > > > > > > > > > > > 2. I think we need "rx_domain" in device info > > > > > > (which should be treated in boundaries of the > > > > > > switch_domain) if and only if > > > > > > RTE_ETH_DEV_CAPA_RX_SHARE is advertised. > > > > > > Otherwise rx_domain value does not make sense. > > > > > > > > > > I see, this will give flexibility of different hw, will add it. > > > > > > > > > > > > > > > > > (1) and (2) will allow application to find out which > > > > > > devices can share Rx. > > > > > > > > > > > > 3. Primary device (representors backing device) should > > > > > > advertise shared RxQ offload. Enabling of the offload > > > > > > tells the device to provide packets to all device in > > > > > > the Rx domain with mbuf->port filled in appropriately. > > > > > > Also it allows app to identify primary device in the > > > > > > Rx domain. When application enables the offload, it > > > > > > must ensure that it does not treat used port_id as an > > > > > > input port_id, but always check mbuf->port for each > > > > > > packet. > > > > > > > > > > > > 4. A new Rx mode should be introduced for secondary > > > > > > devices. It should not allow to configure RSS, specify > > > > > > any Rx offloads etc. ethdev must ensure it. > > > > > > It is an open question right now if it should require > > > > > > to provide primary port_id. In theory representors > > > > > > have it. However, may be it is nice for consistency > > > > > > to ensure that application knows that it does. > > > > > > If shared Rx mode is specified for device, application > > > > > > does not need to setup RxQs and attempts to do it > > > > > > should be discarded in ethdev. > > > > > > For consistency it is better to ensure that number of > > > > > > queues match. > > > > > > > > > > RSS and Rx offloads should be supported as individual, PMD needs to > > > > > check if not supported. > > > > > > Thinking a bit more about it I agree that RSS settings could > > > be individual. Offload could be individual as well, but I'm > > > not sure about all offloads. E.g. Rx scatter which is related > > > to Rx buffer size (which is shared since Rx mempool is shared) > > > vs MTU. May be it is acceptable. We just must define rules > > > what should happen if offloads contradict to each other. > > > It should be highlighted in the description including > > > driver callback to ensure that PMD maintainers are responsible > > > for consistency checks. > > > > > > > > > > > > > > It is an interesting question what should happen if > > > > > > primary device is reconfigured and shared Rx is > > > > > > disabled on reconfiguration. > > > > > > > > > > I feel better no primary port/queue assumption in configuration, all > > > > > members are equally treated, each queue can join or quit share group, > > > > > that's important to support multiple groups. > > > > > > I agree. The problem of many flexible solutions is > > > complexity to support. We'll see how it goes. > > > > > > > > > > > > > > > > > > > > 5. If so, in theory implementation of the Rx burst > > > > > > in the secondary could simply call Rx burst on > > > > > > primary device. > > > > > > > > > > > > Andrew. > > > > > > > > > > > > > Hi Andrew, > > > > > > > > I realized that we are talking different things, this feature > > > > introduced 2 RxQ share: > > > > 1. Share mempool to save memory > > > > 2. Share polling to save latency > > > > > > > > What you suggested is reuse all RxQ configuration IIUC, maybe we should > > > > break the flag into 3, so application could learn PMD capability and > > > > configure accordingly, how do you think? > > > > RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POOL > > > > > > Not sure that I understand. Just specify the same mempool > > > on Rx queue setup. Isn't it sufficient? > > > > > > > RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POLL > > > > > > It implies pool sharing if I'm not mistaken. Of course, > > > we can pool many different HW queues in one poll, but it > > > hardly makes sense to care specially about it. > > > IMHO RxQ sharing is a sharing of the underlying HW Rx queue. > > > > > > > RTE_ETH_RX_OFFLOAD_RXQ_SHARE_CFG //implies POOL and POLL > > > > > > It is hardly a feature. Rather a possible limitation. > > > > Thanks, then I'd drop this suggestion then. > > > > Here is the TODO list, let me know if anything missing: > > 1. change offload flag to RTE_ETH_DEV_CAPA_RX_SHARE > > RTE_ETH_DEV_CAPA_RXQ_SHARE since it is not sharing of > entire Rx, but just some queues.
OK > > > 2. RxQ share group check in ethdev > > 3. add rx_domain into device info Seems rte_eth_switch_info is better palce for rx_domain. > > > > > > > > Andrew. > > >