On 10/12/21 1:55 PM, Xueming(Steven) Li wrote:
> On Tue, 2021-10-12 at 11:48 +0300, Andrew Rybchenko wrote:
>> On 10/12/21 9:37 AM, Xueming(Steven) Li wrote:
>>> On Mon, 2021-10-11 at 23:11 +0800, Xueming Li wrote:
>>>> On Mon, 2021-10-11 at 14:49 +0300, Andrew Rybchenko wrote:
>>>>> Hi Xueming,
>>>>>
>>>>> On 9/30/21 5:55 PM, Xueming Li wrote:
>>>>>> In current DPDK framework, all RX queues is pre-loaded with mbufs for
>>>>>> incoming packets. When number of representors scale out in a switch
>>>>>> domain, the memory consumption became significant. Further more,
>>>>>> polling all ports leads to high cache miss, high latency and low
>>>>>> throughputs.
>>>>>>  
>>>>>> This patch introduces shared RX queue. PF and representors with same
>>>>>> configuration in same switch domain could share RX queue set by
>>>>>> specifying shared Rx queue offloading flag and sharing group.
>>>>>>
>>>>>> All ports that Shared Rx queue actually shares One Rx queue and only
>>>>>> pre-load mbufs to one Rx queue, memory is saved.
>>>>>>
>>>>>> Polling any queue using same shared RX queue receives packets from all
>>>>>> member ports. Source port is identified by mbuf->port.
>>>>>>
>>>>>> Multiple groups is supported by group ID. Port queue number in a shared
>>>>>> group should be identical. Queue index is 1:1 mapped in shared group.
>>>>>> An example of polling two share groups:
>>>>>>   core   group   queue
>>>>>>   0      0       0
>>>>>>   1      0       1
>>>>>>   2      0       2
>>>>>>   3      0       3
>>>>>>   4      1       0
>>>>>>   5      1       1
>>>>>>   6      1       2
>>>>>>   7      1       3
>>>>>>
>>>>>> Shared RX queue must be polled on single thread or core. If both PF0 and
>>>>>> representor0 joined same share group, can't poll pf0rxq0 on core1 and
>>>>>> rep0rxq0 on core2. Actually, polling one port within share group is
>>>>>> sufficient since polling any port in group will return packets for any
>>>>>> port in group.
>>>>>
>>>>> I apologize that I jump in into the review process that late.
>>>>
>>>> Appreciate the bold suggestion, never too late :)
>>>>
>>>>>
>>>>> Frankly speaking I doubt that it is the best design to solve
>>>>> the problem. Yes, I confirm that the problem exists, but I
>>>>> think there is better and simpler way to solve it.
>>>>>
>>>>> The problem of the suggested solution is that it puts all
>>>>> the headache about consistency to application and PMDs
>>>>> without any help from ethdev layer to guarantee the
>>>>> consistency. As the result I believe it will be either
>>>>> missing/lost consistency checks or huge duplication in
>>>>> each PMD which supports the feature. Shared RxQs must be
>>>>> equally configured including number of queues, offloads
>>>>> (taking device level Rx offloads into account), RSS
>>>>> settings etc. So, applications must care about it and
>>>>> PMDs (or ethdev layer) must check it.
>>>>
>>>> The name might be confusing, here is my understanding:
>>>> 1. NIC  shares the buffer supply HW Q between shared RxQs - for memory
>>>> 2. PMD polls one shared RxQ - for latency and performance
>>>> 3. Most per queue features like offloads and RSS not impacted. That's
>>>> why this not mentioned. Some offloading might not being supported due
>>>> to PMD or hw limitation, need to add check in PMD case by case.
>>>> 4. Multiple group is defined for service level flexibility. For
>>>> example, PF and VIP customer's load distributed via queues and dedicate
>>>> cores. Low priority customers share one core with one shared queue.
>>>> multiple groups enables more combination.
>>>> 5. One port could assign queues to different group for polling
>>>> flexibility. For example first 4 queues in group 0 and next 4 queues in
>>>> group1, each group have other member ports with 4 queues, so the port
>>>> with 8 queues could be polled with 8 cores w/o non-shared rxq penalty,
>>>> in other words, each core only poll one shared RxQ.
>>>>
>>>>>
>>>>> The advantage of the solution is that any device may
>>>>> create group and subsequent devices join. Absence of
>>>>> primary device is nice. But do we really need it?
>>>>> Will the design work if some representors are configured
>>>>> to use shared RxQ, but some do not? Theoretically it
>>>>> is possible, but could require extra non-trivial code
>>>>> on fast path.
>>>>
>>>> If multiple groups, any device could be hot-unplugged.
>>>>
>>>> Mixed configuration is supported, the only difference is how to set
>>>> mbuf->port. Since group is per queue, mixed is better to be supported,
>>>> didn't see any difficulty here.
>>>>
>>>> PDM could select to support only group 0, same settings for each rxq,
>>>> that fits most scenario.
>>>>
>>>>>
>>>>> Also looking at the first two patch I don't understand
>>>>> how application will find out which devices may share
>>>>> RxQs. E.g. if we have two difference NICs which support
>>>>> sharing, we can try to setup only one group 0, but
>>>>> finally will have two devices (not one) which must be
>>>>> polled.
>>>>>
>>>>> 1. We need extra flag in dev_info->dev_capa
>>>>>    RTE_ETH_DEV_CAPA_RX_SHARE to advertise that
>>>>>    the device supports Rx sharing.
>>>>
>>>> dev_info->rx_queue_offload_capa could be used here, no?
>>
>> It depends. But we definitely need a flag which
>> says that below rx_domain makes sense. It could be
>> either RTE_ETH_DEV_CAPA_RX_SHARE or an Rx offload
>> capability.
>>
>> The question is if it is really an offload. The offload is
>> when something could be done by HW/FW and result is provided
>> to SW. May be it is just a nit picking...
>>
>> May be we don't need an offload at all. Just have
>> RTE_ETH_DEV_CAPA_RXQ_SHARE and use non-zero group ID
>> as a flag that an RxQ should be shared (zero - default,
>> no sharing). ethdev layer may check consistency on
>> its layer to ensure that the device capability is
>> reported if non-zero group is specified on queue setup.
>>
>>>>
>>>>>
>>>>> 2. I think we need "rx_domain" in device info
>>>>>    (which should be treated in boundaries of the
>>>>>    switch_domain) if and only if
>>>>>    RTE_ETH_DEV_CAPA_RX_SHARE is advertised.
>>>>>    Otherwise rx_domain value does not make sense.
>>>>
>>>> I see, this will give flexibility of different hw, will add it.
>>>>
>>>>>
>>>>> (1) and (2) will allow application to find out which
>>>>> devices can share Rx.
>>>>>
>>>>> 3. Primary device (representors backing device) should
>>>>>    advertise shared RxQ offload. Enabling of the offload
>>>>>    tells the device to provide packets to all device in
>>>>>    the Rx domain with mbuf->port filled in appropriately.
>>>>>    Also it allows app to identify primary device in the
>>>>>    Rx domain. When application enables the offload, it
>>>>>    must ensure that it does not treat used port_id as an
>>>>>    input port_id, but always check mbuf->port for each
>>>>>    packet.
>>>>>
>>>>> 4. A new Rx mode should be introduced for secondary
>>>>>    devices. It should not allow to configure RSS, specify
>>>>>    any Rx offloads etc. ethdev must ensure it.
>>>>>    It is an open question right now if it should require
>>>>>    to provide primary port_id. In theory representors
>>>>>    have it. However, may be it is nice for consistency
>>>>>    to ensure that application knows that it does.
>>>>>    If shared Rx mode is specified for device, application
>>>>>    does not need to setup RxQs and attempts to do it
>>>>>    should be discarded in ethdev.
>>>>>    For consistency it is better to ensure that number of
>>>>>    queues match.
>>>>
>>>> RSS and Rx offloads should be supported as individual, PMD needs to
>>>> check if not supported.
>>
>> Thinking a bit more about it I agree that RSS settings could
>> be individual. Offload could be individual as well, but I'm
>> not sure about all offloads. E.g. Rx scatter which is related
>> to Rx buffer size (which is shared since Rx mempool is shared)
>> vs MTU. May be it is acceptable. We just must define rules
>> what should happen if offloads contradict to each other.
>> It should be highlighted in the description including
>> driver callback to ensure that PMD maintainers are responsible
>> for consistency checks.
>>
>>>>
>>>>>    It is an interesting question what should happen if
>>>>>    primary device is reconfigured and shared Rx is
>>>>>    disabled on reconfiguration.
>>>>
>>>> I feel better no primary port/queue assumption in configuration, all
>>>> members are equally treated, each queue can join or quit share group,
>>>> that's important to support multiple groups.
>>
>> I agree. The problem of many flexible solutions is
>> complexity to support. We'll see how it goes.
>>
>>>>
>>>>>
>>>>> 5. If so, in theory implementation of the Rx burst
>>>>>    in the secondary could simply call Rx burst on
>>>>>    primary device.
>>>>>
>>>>> Andrew.
>>>>
>>>
>>> Hi Andrew,
>>>
>>> I realized that we are talking different things, this feature
>>> introduced 2 RxQ share:
>>> 1. Share mempool to save memory
>>> 2. Share polling to save latency
>>>
>>> What you suggested is reuse all RxQ configuration IIUC, maybe we should
>>> break the flag into 3, so application could learn PMD capability and
>>> configure accordingly, how do you think?
>>> RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POOL
>>
>> Not sure that I understand. Just specify the same mempool
>> on Rx queue setup. Isn't it sufficient?
>>
>>> RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POLL
>>
>> It implies pool sharing if I'm not mistaken. Of course,
>> we can pool many different HW queues in one poll, but it
>> hardly makes sense to care specially about it.
>> IMHO RxQ sharing is a sharing of the underlying HW Rx queue.
>>
>>> RTE_ETH_RX_OFFLOAD_RXQ_SHARE_CFG //implies POOL and POLL
>>
>> It is hardly a feature. Rather a possible limitation.
> 
> Thanks, then I'd drop this suggestion then.
> 
> Here is the TODO list, let me know if anything missing:
> 1. change offload flag to RTE_ETH_DEV_CAPA_RX_SHARE

RTE_ETH_DEV_CAPA_RXQ_SHARE since it is not sharing of
entire Rx, but just some queues.

> 2. RxQ share group check in ethdev
> 3. add rx_domain into device info
> 
>>
>> Andrew.
> 

Reply via email to