> -----邮件原件-----
> 发件人: Konstantin Ananyev <konstantin.v.anan...@yandex.ru>
> 发送时间: Tuesday, May 24, 2022 9:26 AM
> 收件人: Feifei Wang <feifei.wa...@arm.com>
> 抄送: nd <n...@arm.com>; dev@dpdk.org; Ruifeng Wang
> <ruifeng.w...@arm.com>; Honnappa Nagarahalli
> <honnappa.nagaraha...@arm.com>
> 主题: Re: [PATCH v1 0/5] Direct re-arming of buffers on receive side
> 
> [konstantin.v.anan...@yandex.ru appears similar to someone who previously
> sent you email, but may not be that person. Learn why this could be a risk at
> https://aka.ms/LearnAboutSenderIdentification.]
> 
> 16/05/2022 07:10, Feifei Wang пишет:
> >
> >>> Currently, the transmit side frees the buffers into the lcore cache
> >>> and the receive side allocates buffers from the lcore cache. The
> >>> transmit side typically frees 32 buffers resulting in 32*8=256B of
> >>> stores to lcore cache. The receive side allocates 32 buffers and
> >>> stores them in the receive side software ring, resulting in
> >>> 32*8=256B of stores and 256B of load from the lcore cache.
> >>>
> >>> This patch proposes a mechanism to avoid freeing to/allocating from
> >>> the lcore cache. i.e. the receive side will free the buffers from
> >>> transmit side directly into it's software ring. This will avoid the
> >>> 256B of loads and stores introduced by the lcore cache. It also
> >>> frees up the cache lines used by the lcore cache.
> >>>
> >>> However, this solution poses several constraints:
> >>>
> >>> 1)The receive queue needs to know which transmit queue it should
> >>> take the buffers from. The application logic decides which transmit
> >>> port to use to send out the packets. In many use cases the NIC might
> >>> have a single port ([1], [2], [3]), in which case a given transmit
> >>> queue is always mapped to a single receive queue (1:1 Rx queue: Tx
> >>> queue). This is easy to configure.
> >>>
> >>> If the NIC has 2 ports (there are several references), then we will
> >>> have
> >>> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> >>> However, if this is generalized to 'N' ports, the configuration can
> >>> be long. More over the PMD would have to scan a list of transmit
> >>> queues to pull the buffers from.
> >
> >> Just to re-iterate some generic concerns about this proposal:
> >>   - We effectively link RX and TX queues - when this feature is enabled,
> >>     user can't stop TX queue without stopping linked RX queue first.
> >>     Right now user is free to start/stop any queues at his will.
> >>     If that feature will allow to link queues from different ports,
> >>     then even ports will become dependent and user will have to pay extra
> >>     care when managing such ports.
> >
> > [Feifei] When direct rearm enabled, there are two path for thread to
> > choose. If there are enough Tx freed buffers, Rx can put buffers from
> > Tx.
> > Otherwise, Rx will put buffers from mempool as usual. Thus, users do
> > not need to pay much attention managing ports.
> 
> What I am talking about: right now different port or different queues of the
> same port can be treated as independent entities:
> in general user is free to start/stop (and even reconfigure in some
> cases) one entity without need to stop other entity.
> I.E user can stop and re-configure TX queue while keep receiving packets
> from RX queue.
> With direct re-arm enabled, I think it wouldn't be possible any more:
> before stopping/reconfiguring TX queue user would have make sure that
> corresponding RX queue wouldn't be used by datapath.
> 
> >
> >> - very limited usage scenario - it will have a positive effect only
> >>    when we have a fixed forwarding mapping: all (or nearly all) packets
> >>    from the RX queue are forwarded into the same TX queue.
> >
> > [Feifei] Although the usage scenario is limited, this usage scenario
> > has a wide range of applications, such as NIC with one port.
> 
> yes, there are NICs with one port, but no guarantee there wouldn't be several
> such NICs within the system.
> 
> > Furtrhermore, I think this is a tradeoff between performance and
> > flexibility.
> > Our goal is to achieve best performance, this means we need to give up
> > some flexibility decisively. For example of 'FAST_FREE Mode', it
> > deletes most of the buffer check (refcnt > 1, external buffer, chain
> > buffer), chooses a shorest path, and then achieve significant performance
> improvement.
> >> Wonder did you had a chance to consider mempool-cache ZC API, similar
> >> to one we have for the ring?
> >> It would allow us on TX free path to avoid copying mbufs to temporary
> >> array on the stack.
> >> Instead we can put them straight from TX SW ring to the mempool cache.
> >> That should save extra store/load for mbuf and might help to achieve
> >> some performance gain without by-passing mempool.
> >> It probably wouldn't be as fast as what you proposing, but might be
> >> fast enough to consider as alternative.
> >> Again, it would be a generic one, so we can avoid all these
> >> implications and limitations.
> >
> > [Feifei] I think this is a good try. However, the most important thing
> > is that if we can bypass the mempool decisively to pursue the
> > significant performance gains.
> 
> I understand the intention, and I personally think this is wrong and dangerous
> attitude.
> We have mempool abstraction in place for very good reason.
> So we need to try to improve mempool performance (and API if necessary) at
> first place, not to avoid it and break our own rules and recommendations.
> 
> 
> > For ZC, there maybe a problem for it in i40e. The reason for that put
> > Tx buffers into temporary is that i40e_tx_entry includes buffer
> > pointer and index.
> > Thus we cannot put Tx SW_ring entry into mempool directly, we need to
> > firstlt extract mbuf pointer. Finally, though we use ZC, we still
> > can't avoid using a temporary stack to extract Tx buffer pointers.
> 
> When talking about ZC API for mempool cache I meant something like:
> void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc,
> uint32_t *nb_elem, uint32_t flags); void
> mempool_cache_put_zc_finish(struct rte_mempool_cache *mc, uint32_t
> nb_elem); i.e. _start_ will return user a pointer inside mp-cache where to put
> free elems and max number of slots that can be safely filled.
> _finish_ will update mc->len.
> As an example:
> 
> /* expect to free N mbufs */
> uint32_t n = N;
> void **p = mempool_cache_put_zc_start(mc, &n, ...);
> 
> /* free up to n elems */
> for (i = 0; i != n; i++) {
> 
>    /* get next free mbuf from somewhere */
>    mb = extract_and_prefree_mbuf(...);
> 
>    /* no more free mbufs for now */
>    if (mb == NULL)
>       break;
> 
>    p[i] = mb;
> }
> 
> /* finalize ZC put, with _i_ freed elems */ mempool_cache_put_zc_finish(mc,
> i);
> 
> That way, I think we can overcome the issue with i40e_tx_entry you
> mentioned above. Plus it might be useful in other similar places.
> 
> Another alternative is obviously to split i40e_tx_entry into two structs (one
> for mbuf, second for its metadata) and have a separate array for each of them.
> Though with that approach we need to make sure no perf drops will be
> introduced, plus probably more code changes will be required.

[Feifei] I just uploaded a RFC patch to the community:
http://patches.dpdk.org/project/dpdk/patch/20220613055136.1949784-1-feifei.wa...@arm.com/
I do not use ZC api, and refer to i40e avx512 path, directly put mempool cache 
out of API.
You can take a look when you are free. Thanks very much.
> 
> 
> 
> 

Reply via email to