<snip>

> > >>>
> > >>>>
> > >>>> [konstantin.v.anan...@yandex.ru appears similar to someone who
> > >>>> previously sent you email, but may not be that person. Learn why
> > this
> > >>>> could be a risk at
> > >>>> https://aka.ms/LearnAboutSenderIdentification.]
> > >>>>
> > >>>> 16/05/2022 07:10, Feifei Wang пишет:
> > >>>>>
> > >>>>>>> Currently, the transmit side frees the buffers into the lcore
> > >>>>>>> cache and the receive side allocates buffers from the lcore
> > cache.
> > >>>>>>> The transmit side typically frees 32 buffers resulting in
> > >>>>>>> 32*8=256B of stores to lcore cache. The receive side allocates
> > 32
> > >>>>>>> buffers and stores them in the receive side software ring,
> > >>>>>>> resulting in 32*8=256B of stores and 256B of load from the
> > lcore cache.
> > >>>>>>>
> > >>>>>>> This patch proposes a mechanism to avoid freeing to/allocating
> > >>>>>>> from the lcore cache. i.e. the receive side will free the
> > buffers
> > >>>>>>> from transmit side directly into it's software ring. This will
> > >>>>>>> avoid the 256B of loads and stores introduced by the lcore
> > cache.
> > >>>>>>> It also frees up the cache lines used by the lcore cache.
> > >>>>>>>
> > >>>>>>> However, this solution poses several constraints:
> > >>>>>>>
> > >>>>>>> 1)The receive queue needs to know which transmit queue it
> > should
> > >>>>>>> take the buffers from. The application logic decides which
> > >>>>>>> transmit port to use to send out the packets. In many use
> > >>>>>>> cases the NIC might have a single port ([1], [2], [3]), in
> > >>>>>>> which case
> > a
> > >>>>>>> given transmit queue is always mapped to a single receive
> > >>>>>>> queue
> > >>>>>>> (1:1 Rx queue: Tx queue). This is easy to configure.
> > >>>>>>>
> > >>>>>>> If the NIC has 2 ports (there are several references), then we
> > >>>>>>> will have
> > >>>>>>> 1:2 (RX queue: TX queue) mapping which is still easy to
> > configure.
> > >>>>>>> However, if this is generalized to 'N' ports, the
> > >>>>>>> configuration can be long. More over the PMD would have to
> > >>>>>>> scan a list of transmit queues to pull the buffers from.
> > >>>>>
> > >>>>>> Just to re-iterate some generic concerns about this proposal:
> > >>>>>>     - We effectively link RX and TX queues - when this feature
> > is enabled,
> > >>>>>>       user can't stop TX queue without stopping linked RX queue
> > first.
> > >>>>>>       Right now user is free to start/stop any queues at his
> > will.
> > >>>>>>       If that feature will allow to link queues from different
> > ports,
> > >>>>>>       then even ports will become dependent and user will have
> > to pay extra
> > >>>>>>       care when managing such ports.
> > >>>>>
> > >>>>> [Feifei] When direct rearm enabled, there are two path for
> > >>>>> thread
> > to
> > >>>>> choose. If there are enough Tx freed buffers, Rx can put buffers
> > >>>>> from Tx.
> > >>>>> Otherwise, Rx will put buffers from mempool as usual. Thus,
> > >>>>> users
> > do
> > >>>>> not need to pay much attention managing ports.
> > >>>>
> > >>>> What I am talking about: right now different port or different
> > queues
> > >>>> of the same port can be treated as independent entities:
> > >>>> in general user is free to start/stop (and even reconfigure in
> > some
> > >>>> cases) one entity without need to stop other entity.
> > >>>> I.E user can stop and re-configure TX queue while keep receiving
> > >>>> packets from RX queue.
> > >>>> With direct re-arm enabled, I think it wouldn't be possible any
> > more:
> > >>>> before stopping/reconfiguring TX queue user would have make sure
> > that
> > >>>> corresponding RX queue wouldn't be used by datapath.
> > >>> I am trying to understand the problem better. For the TX queue to
> > be stopped,
> > >> the user must have blocked the data plane from accessing the TX
> > queue.
> > >>
> > >> Surely it is user responsibility tnot to call tx_burst() for
> > stopped/released queue.
> > >> The problem is that while TX for that queue is stopped, RX for
> > related queue still
> > >> can continue.
> > >> So rx_burst() will try to read/modify TX queue data, that might be
> > already freed,
> > >> or simultaneously modified by control path.
> > > Understood, agree on the issue
> > >
> > >>
> > >> Again, it all can be mitigated by carefully re-designing and
> > modifying control and
> > >> data-path inside user app - by doing extra checks and
> > synchronizations, etc.
> > >> But from practical point - I presume most of users simply would
> > avoid using this
> > >> feature due all potential problems it might cause.
> > > That is subjective, it all depends on the performance improvements
> > users see in their application.
> > > IMO, the performance improvement seen with this patch is worth few
> > changes.
> >
> > Yes, it is subjective till some extent, though my feeling that it
> > might end-up being sort of synthetic improvement used only by some
> > show-case benchmarks.
> 
> I believe that one specific important use case has already been mentioned, so 
> I
> don't think this is a benchmark only feature.
+1

> 
> >  From my perspective, it would be much more plausible, if we can
> > introduce some sort of generic improvement, that doesn't impose all
> > these extra constraints and implications.
> > Like one, discussed below in that thread with ZC mempool approach.
> >
> 
> Considering this feature from a high level perspective, I agree with 
> Konstantin's
> concerns, so I'll also support his views.
We did hack the ZC mempool approach [1], level of improvement is pretty small 
compared with this patch.

[1] 
http://patches.dpdk.org/project/dpdk/patch/20220613055136.1949784-1-feifei.wa...@arm.com/

> 
> If this patch is supposed to be a generic feature, please add support for it 
> in all
> NIC PMDs, not just one. (Regardless if the feature is defined as 1:1 mapping 
> or
> N:M mapping.) It is purely software, so it should be available for all PMDs, 
> not
> just your favorite hardware! Consider the "fast mbuf free" feature, which is
> pure software; why is that feature not implemented in all PMDs?
Agree, it is good to have it supported in all the drivers. We do not have a 
favorite hardware, just picked a PMD which we are more familiar with. We do 
plan to implement in other prominent PMDs.

> 
> A secondary point I'm making here is that this specific feature will lead to 
> an
> enormous amount of copy-paste code, instead of a generic library function
> easily available for all PMDs.
Are you talking about the i40e driver code in specific? If yes, agree we should 
avoid copy-paste and we will look to reduce that.


> 
> >
> > >>
> > >>> Like Feifei says, the RX side has the normal packet allocation
> > >>> path
> > still available.
> > >>> Also this sounds like a corner case to me, we can handle this
> > through checks in
> > >> the queue_stop API.
> > >>
> > >> Depends.
> > >> if it would be allowed to link queues only from the same port, then
> > yes, extra
> > >> checks for queue-stop might be enough.
> > >> As right now DPDK doesn't allow user to change number of queues
> > without
> > >> dev_stop() first.
> > >> Though if it would be allowed to link queues from different ports,
> > then situation
> > >> will be much worse.
> > >> Right now ports are totally independent entities (except some
> > special cases like
> > >> link-bonding, etc.).
> > >> As one port can keep doing RX/TX, second one can be stopped, re-
> > confgured,
> > >> even detached, and newly attached device might re-use same port
> > number.
> > > I see this as a similar restriction to the one discussed above.
> >
> > Yes, they are similar in principal, though I think that the case with
> > queues from different port would make things much more complex.
> >
> >  > Do you see any issues if we enforce this with checks?
> >
> > Hard to tell straightway, a lot will depend how smart such
> > implementation would be.
> > Usually DPDK tends not to avoid heavy
> > synchronizations within its data-path functions.
> 
> Certainly! Implementing more and more of such features in the PMDs will lead
> to longer and longer data plane code paths in the PMDs. It is the "salami
> method", where each small piece makes no performance difference, but they all
> add up, and eventually the sum of them does impact the performance of the
> general use case negatively.
It would be good to have a test running in UNH that shows the performance trend.

> 
> >
> > >>
> > >>
> > >>>>
> > >>>>>
> > >>>>>> - very limited usage scenario - it will have a positive effect
> > only
> > >>>>>>      when we have a fixed forwarding mapping: all (or nearly
> > all) packets
> > >>>>>>      from the RX queue are forwarded into the same TX queue.
> > >>>>>
> > >>>>> [Feifei] Although the usage scenario is limited, this usage
> > scenario
> > >>>>> has a wide range of applications, such as NIC with one port.
> > >>>>
> > >>>> yes, there are NICs with one port, but no guarantee there
> > >>>> wouldn't
> > be
> > >>>> several such NICs within the system.
> > >>> What I see in my interactions is, a single NIC/DPU is under
> > utilized for a 2
> > >> socket system. Some are adding more sockets to the system to better
> > utilize the
> > >> DPU. The NIC bandwidth continues to grow significantly. I do not
> > think there will
> > >> be a multi-DPU per server scenario.
> > >>
> > >>
> > >> Interesting... from my experience it is visa-versa:
> > >> in many cases 200Gb/s is not that much these days to saturate
> > >> modern
> > 2 socket
> > >> x86 server.
> > >> Though I suppose a lot depends on particular HW and actual workload.
> > >>
> > >>>
> > >>>>
> > >>>>> Furtrhermore, I think this is a tradeoff between performance and
> > >>>>> flexibility.
> > >>>>> Our goal is to achieve best performance, this means we need to
> > give
> > >>>>> up some flexibility decisively. For example of 'FAST_FREE Mode',
> > it
> > >>>>> deletes most of the buffer check (refcnt > 1, external buffer,
> > chain
> > >>>>> buffer), chooses a shorest path, and then achieve significant
> > >>>>> performance
> > >>>> improvement.
> > >>>>>> Wonder did you had a chance to consider mempool-cache ZC API,
> > >>>>>> similar to one we have for the ring?
> > >>>>>> It would allow us on TX free path to avoid copying mbufs to
> > >>>>>> temporary array on the stack.
> > >>>>>> Instead we can put them straight from TX SW ring to the mempool
> > cache.
> > >>>>>> That should save extra store/load for mbuf and might help to
> > >>>>>> achieve some performance gain without by-passing mempool.
> > >>>>>> It probably wouldn't be as fast as what you proposing, but
> > >>>>>> might
> > be
> > >>>>>> fast enough to consider as alternative.
> > >>>>>> Again, it would be a generic one, so we can avoid all these
> > >>>>>> implications and limitations.
> > >>>>>
> > >>>>> [Feifei] I think this is a good try. However, the most important
> > >>>>> thing is that if we can bypass the mempool decisively to pursue
> > the
> > >>>>> significant performance gains.
> > >>>>
> > >>>> I understand the intention, and I personally think this is wrong
> > and
> > >>>> dangerous attitude.
> > >>>> We have mempool abstraction in place for very good reason.
> > >>>> So we need to try to improve mempool performance (and API if
> > >>>> necessary) at first place, not to avoid it and break our own
> > >>>> rules
> > and
> > >> recommendations.
> > >>> The abstraction can be thought of at a higher level. i.e. the
> > driver manages the
> > >> buffer allocation/free and is hidden from the application. The
> > application does
> > >> not need to be aware of how these changes are implemented.
> > >>>
> > >>>>
> > >>>>
> > >>>>> For ZC, there maybe a problem for it in i40e. The reason for
> > >>>>> that put Tx buffers into temporary is that i40e_tx_entry
> > >>>>> includes
> > buffer
> > >>>>> pointer and index.
> > >>>>> Thus we cannot put Tx SW_ring entry into mempool directly, we
> > need
> > >>>>> to firstlt extract mbuf pointer. Finally, though we use ZC, we
> > still
> > >>>>> can't avoid using a temporary stack to extract Tx buffer
> > pointers.
> > >>>>
> > >>>> When talking about ZC API for mempool cache I meant something
> > like:
> > >>>> void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc,
> > >>>> uint32_t *nb_elem, uint32_t flags); void
> > >>>> mempool_cache_put_zc_finish(struct
> > >>>> rte_mempool_cache *mc, uint32_t nb_elem); i.e. _start_ will
> > >>>> return user a pointer inside mp-cache where to put free elems and
> > >>>> max
> > number
> > >>>> of slots that can be safely filled.
> > >>>> _finish_ will update mc->len.
> > >>>> As an example:
> > >>>>
> > >>>> /* expect to free N mbufs */
> > >>>> uint32_t n = N;
> > >>>> void **p = mempool_cache_put_zc_start(mc, &n, ...);
> > >>>>
> > >>>> /* free up to n elems */
> > >>>> for (i = 0; i != n; i++) {
> > >>>>
> > >>>>      /* get next free mbuf from somewhere */
> > >>>>      mb = extract_and_prefree_mbuf(...);
> > >>>>
> > >>>>      /* no more free mbufs for now */
> > >>>>      if (mb == NULL)
> > >>>>         break;
> > >>>>
> > >>>>      p[i] = mb;
> > >>>> }
> > >>>>
> > >>>> /* finalize ZC put, with _i_ freed elems */
> > >>>> mempool_cache_put_zc_finish(mc, i);
> > >>>>
> > >>>> That way, I think we can overcome the issue with i40e_tx_entry
> > >>>> you mentioned above. Plus it might be useful in other similar places.
> > >>>>
> > >>>> Another alternative is obviously to split i40e_tx_entry into two
> > >>>> structs (one for mbuf, second for its metadata) and have a
> > separate
> > >>>> array for each of them.
> > >>>> Though with that approach we need to make sure no perf drops will
> > be
> > >>>> introduced, plus probably more code changes will be required.
> > >>> Commit '5171b4ee6b6" already does this (in a different way), but
> > just for
> > >> AVX512. Unfortunately, it does not record any performance
> > improvements. We
> > >> could port this to Arm NEON and look at the performance.
> > >
> >

Reply via email to