> -----邮件原件----- > 发件人: Konstantin Ananyev <konstantin.v.anan...@yandex.ru> > 发送时间: Tuesday, May 24, 2022 9:26 AM > 收件人: Feifei Wang <feifei.wa...@arm.com> > 抄送: nd <n...@arm.com>; dev@dpdk.org; Ruifeng Wang > <ruifeng.w...@arm.com>; Honnappa Nagarahalli > <honnappa.nagaraha...@arm.com> > 主题: Re: [PATCH v1 0/5] Direct re-arming of buffers on receive side > > [konstantin.v.anan...@yandex.ru appears similar to someone who previously > sent you email, but may not be that person. Learn why this could be a risk at > https://aka.ms/LearnAboutSenderIdentification.] > > 16/05/2022 07:10, Feifei Wang пишет: > > > >>> Currently, the transmit side frees the buffers into the lcore cache > >>> and the receive side allocates buffers from the lcore cache. The > >>> transmit side typically frees 32 buffers resulting in 32*8=256B of > >>> stores to lcore cache. The receive side allocates 32 buffers and > >>> stores them in the receive side software ring, resulting in > >>> 32*8=256B of stores and 256B of load from the lcore cache. > >>> > >>> This patch proposes a mechanism to avoid freeing to/allocating from > >>> the lcore cache. i.e. the receive side will free the buffers from > >>> transmit side directly into it's software ring. This will avoid the > >>> 256B of loads and stores introduced by the lcore cache. It also > >>> frees up the cache lines used by the lcore cache. > >>> > >>> However, this solution poses several constraints: > >>> > >>> 1)The receive queue needs to know which transmit queue it should > >>> take the buffers from. The application logic decides which transmit > >>> port to use to send out the packets. In many use cases the NIC might > >>> have a single port ([1], [2], [3]), in which case a given transmit > >>> queue is always mapped to a single receive queue (1:1 Rx queue: Tx > >>> queue). This is easy to configure. > >>> > >>> If the NIC has 2 ports (there are several references), then we will > >>> have > >>> 1:2 (RX queue: TX queue) mapping which is still easy to configure. > >>> However, if this is generalized to 'N' ports, the configuration can > >>> be long. More over the PMD would have to scan a list of transmit > >>> queues to pull the buffers from. > > > >> Just to re-iterate some generic concerns about this proposal: > >> - We effectively link RX and TX queues - when this feature is enabled, > >> user can't stop TX queue without stopping linked RX queue first. > >> Right now user is free to start/stop any queues at his will. > >> If that feature will allow to link queues from different ports, > >> then even ports will become dependent and user will have to pay extra > >> care when managing such ports. > > > > [Feifei] When direct rearm enabled, there are two path for thread to > > choose. If there are enough Tx freed buffers, Rx can put buffers from > > Tx. > > Otherwise, Rx will put buffers from mempool as usual. Thus, users do > > not need to pay much attention managing ports. > > What I am talking about: right now different port or different queues of the > same port can be treated as independent entities: > in general user is free to start/stop (and even reconfigure in some > cases) one entity without need to stop other entity. > I.E user can stop and re-configure TX queue while keep receiving packets > from RX queue. > With direct re-arm enabled, I think it wouldn't be possible any more: > before stopping/reconfiguring TX queue user would have make sure that > corresponding RX queue wouldn't be used by datapath. > > > > >> - very limited usage scenario - it will have a positive effect only > >> when we have a fixed forwarding mapping: all (or nearly all) packets > >> from the RX queue are forwarded into the same TX queue. > > > > [Feifei] Although the usage scenario is limited, this usage scenario > > has a wide range of applications, such as NIC with one port. > > yes, there are NICs with one port, but no guarantee there wouldn't be several > such NICs within the system. > > > Furtrhermore, I think this is a tradeoff between performance and > > flexibility. > > Our goal is to achieve best performance, this means we need to give up > > some flexibility decisively. For example of 'FAST_FREE Mode', it > > deletes most of the buffer check (refcnt > 1, external buffer, chain > > buffer), chooses a shorest path, and then achieve significant performance > improvement. > >> Wonder did you had a chance to consider mempool-cache ZC API, similar > >> to one we have for the ring? > >> It would allow us on TX free path to avoid copying mbufs to temporary > >> array on the stack. > >> Instead we can put them straight from TX SW ring to the mempool cache. > >> That should save extra store/load for mbuf and might help to achieve > >> some performance gain without by-passing mempool. > >> It probably wouldn't be as fast as what you proposing, but might be > >> fast enough to consider as alternative. > >> Again, it would be a generic one, so we can avoid all these > >> implications and limitations. > > > > [Feifei] I think this is a good try. However, the most important thing > > is that if we can bypass the mempool decisively to pursue the > > significant performance gains. > > I understand the intention, and I personally think this is wrong and dangerous > attitude. > We have mempool abstraction in place for very good reason. > So we need to try to improve mempool performance (and API if necessary) at > first place, not to avoid it and break our own rules and recommendations. > > > > For ZC, there maybe a problem for it in i40e. The reason for that put > > Tx buffers into temporary is that i40e_tx_entry includes buffer > > pointer and index. > > Thus we cannot put Tx SW_ring entry into mempool directly, we need to > > firstlt extract mbuf pointer. Finally, though we use ZC, we still > > can't avoid using a temporary stack to extract Tx buffer pointers. > > When talking about ZC API for mempool cache I meant something like: > void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc, > uint32_t *nb_elem, uint32_t flags); void > mempool_cache_put_zc_finish(struct rte_mempool_cache *mc, uint32_t > nb_elem); i.e. _start_ will return user a pointer inside mp-cache where to put > free elems and max number of slots that can be safely filled. > _finish_ will update mc->len. > As an example: > > /* expect to free N mbufs */ > uint32_t n = N; > void **p = mempool_cache_put_zc_start(mc, &n, ...); > > /* free up to n elems */ > for (i = 0; i != n; i++) { > > /* get next free mbuf from somewhere */ > mb = extract_and_prefree_mbuf(...); > > /* no more free mbufs for now */ > if (mb == NULL) > break; > > p[i] = mb; > } > > /* finalize ZC put, with _i_ freed elems */ mempool_cache_put_zc_finish(mc, > i); > > That way, I think we can overcome the issue with i40e_tx_entry you > mentioned above. Plus it might be useful in other similar places. > > Another alternative is obviously to split i40e_tx_entry into two structs (one > for mbuf, second for its metadata) and have a separate array for each of them. > Though with that approach we need to make sure no perf drops will be > introduced, plus probably more code changes will be required.
[Feifei] I just uploaded a RFC patch to the community: http://patches.dpdk.org/project/dpdk/patch/20220613055136.1949784-1-feifei.wa...@arm.com/ I do not use ZC api, and refer to i40e avx512 path, directly put mempool cache out of API. You can take a look when you are free. Thanks very much. > > > >