> From: Honnappa Nagarahalli [mailto:honnappa.nagaraha...@arm.com]
> Sent: Wednesday, 29 June 2022 23.58
> 
> <snip>
> 
> > > >>>>
> > > >>>> 16/05/2022 07:10, Feifei Wang пишет:
> > > >>>>>
> > > >>>>>>> Currently, the transmit side frees the buffers into the
> lcore
> > > >>>>>>> cache and the receive side allocates buffers from the lcore
> > > cache.
> > > >>>>>>> The transmit side typically frees 32 buffers resulting in
> > > >>>>>>> 32*8=256B of stores to lcore cache. The receive side
> allocates
> > > 32
> > > >>>>>>> buffers and stores them in the receive side software ring,
> > > >>>>>>> resulting in 32*8=256B of stores and 256B of load from the
> > > lcore cache.
> > > >>>>>>>
> > > >>>>>>> This patch proposes a mechanism to avoid freeing
> to/allocating
> > > >>>>>>> from the lcore cache. i.e. the receive side will free the
> > > buffers
> > > >>>>>>> from transmit side directly into it's software ring. This
> will
> > > >>>>>>> avoid the 256B of loads and stores introduced by the lcore
> > > cache.
> > > >>>>>>> It also frees up the cache lines used by the lcore cache.
> > > >>>>>>>
> > > >>>>>>> However, this solution poses several constraints:
> > > >>>>>>>
> > > >>>>>>> 1)The receive queue needs to know which transmit queue it
> > > should
> > > >>>>>>> take the buffers from. The application logic decides which
> > > >>>>>>> transmit port to use to send out the packets. In many use
> > > >>>>>>> cases the NIC might have a single port ([1], [2], [3]), in
> > > >>>>>>> which case
> > > a
> > > >>>>>>> given transmit queue is always mapped to a single receive
> > > >>>>>>> queue
> > > >>>>>>> (1:1 Rx queue: Tx queue). This is easy to configure.
> > > >>>>>>>
> > > >>>>>>> If the NIC has 2 ports (there are several references), then
> we
> > > >>>>>>> will have
> > > >>>>>>> 1:2 (RX queue: TX queue) mapping which is still easy to
> > > configure.
> > > >>>>>>> However, if this is generalized to 'N' ports, the
> > > >>>>>>> configuration can be long. More over the PMD would have to
> > > >>>>>>> scan a list of transmit queues to pull the buffers from.
> > > >>>>>
> > > >>>>>> Just to re-iterate some generic concerns about this
> proposal:
> > > >>>>>>     - We effectively link RX and TX queues - when this
> feature
> > > is enabled,
> > > >>>>>>       user can't stop TX queue without stopping linked RX
> queue
> > > first.
> > > >>>>>>       Right now user is free to start/stop any queues at his
> > > will.
> > > >>>>>>       If that feature will allow to link queues from
> different
> > > ports,
> > > >>>>>>       then even ports will become dependent and user will
> have
> > > to pay extra
> > > >>>>>>       care when managing such ports.
> > > >>>>>
> > > >>>>> [Feifei] When direct rearm enabled, there are two path for
> > > >>>>> thread
> > > to
> > > >>>>> choose. If there are enough Tx freed buffers, Rx can put
> buffers
> > > >>>>> from Tx.
> > > >>>>> Otherwise, Rx will put buffers from mempool as usual. Thus,
> > > >>>>> users
> > > do
> > > >>>>> not need to pay much attention managing ports.
> > > >>>>
> > > >>>> What I am talking about: right now different port or different
> > > queues
> > > >>>> of the same port can be treated as independent entities:
> > > >>>> in general user is free to start/stop (and even reconfigure in
> > > some
> > > >>>> cases) one entity without need to stop other entity.
> > > >>>> I.E user can stop and re-configure TX queue while keep
> receiving
> > > >>>> packets from RX queue.
> > > >>>> With direct re-arm enabled, I think it wouldn't be possible
> any
> > > more:
> > > >>>> before stopping/reconfiguring TX queue user would have make
> sure
> > > that
> > > >>>> corresponding RX queue wouldn't be used by datapath.
> > > >>> I am trying to understand the problem better. For the TX queue
> to
> > > be stopped,
> > > >> the user must have blocked the data plane from accessing the TX
> > > queue.
> > > >>
> > > >> Surely it is user responsibility tnot to call tx_burst() for
> > > stopped/released queue.
> > > >> The problem is that while TX for that queue is stopped, RX for
> > > related queue still
> > > >> can continue.
> > > >> So rx_burst() will try to read/modify TX queue data, that might
> be
> > > already freed,
> > > >> or simultaneously modified by control path.
> > > > Understood, agree on the issue
> > > >
> > > >>
> > > >> Again, it all can be mitigated by carefully re-designing and
> > > modifying control and
> > > >> data-path inside user app - by doing extra checks and
> > > synchronizations, etc.
> > > >> But from practical point - I presume most of users simply would
> > > avoid using this
> > > >> feature due all potential problems it might cause.
> > > > That is subjective, it all depends on the performance
> improvements
> > > users see in their application.
> > > > IMO, the performance improvement seen with this patch is worth
> few
> > > changes.
> > >
> > > Yes, it is subjective till some extent, though my feeling that it
> > > might end-up being sort of synthetic improvement used only by some
> > > show-case benchmarks.
> >
> > I believe that one specific important use case has already been
> mentioned, so I
> > don't think this is a benchmark only feature.
> +1
> 
> >
> > >  From my perspective, it would be much more plausible, if we can
> > > introduce some sort of generic improvement, that doesn't impose all
> > > these extra constraints and implications.
> > > Like one, discussed below in that thread with ZC mempool approach.
> > >
> >
> > Considering this feature from a high level perspective, I agree with
> Konstantin's
> > concerns, so I'll also support his views.
> We did hack the ZC mempool approach [1], level of improvement is pretty
> small compared with this patch.
> 
> [1] http://patches.dpdk.org/project/dpdk/patch/20220613055136.1949784-
> 1-feifei.wa...@arm.com/
> 
> >
> > If this patch is supposed to be a generic feature, please add support
> for it in all
> > NIC PMDs, not just one. (Regardless if the feature is defined as 1:1
> mapping or
> > N:M mapping.) It is purely software, so it should be available for
> all PMDs, not
> > just your favorite hardware! Consider the "fast mbuf free" feature,
> which is
> > pure software; why is that feature not implemented in all PMDs?
> Agree, it is good to have it supported in all the drivers. We do not
> have a favorite hardware, just picked a PMD which we are more familiar
> with. We do plan to implement in other prominent PMDs.
> 
> >
> > A secondary point I'm making here is that this specific feature will
> lead to an
> > enormous amount of copy-paste code, instead of a generic library
> function
> > easily available for all PMDs.
> Are you talking about the i40e driver code in specific? If yes, agree
> we should avoid copy-paste and we will look to reduce that.

Yes, I am talking about the code that needs to be copied into all prominent 
PMDs. Perhaps you can move the majority of it into a common directory, if not 
in a generic library, so the modification per PMD becomes smaller. (I see the 
same copy-paste issue with the "fast mbuf free" feature, if to be supported by 
other than the i40e PMD.)

Please note that I do not expect you to implement this feature in other PMDs 
than you need. I was trying to make the point that implementing a software 
feature in a PMD requires copy-pasting to other PMDs, which can require a big 
effort; while implementing it in a library and calling the library from the 
PMDs require a smaller effort per PMD. I intentionally phrased it somewhat 
provokingly, and was lucky not to offend anyone. :-)

> 
> 
> >
> > >
> > > >>
> > > >>> Like Feifei says, the RX side has the normal packet allocation
> > > >>> path
> > > still available.
> > > >>> Also this sounds like a corner case to me, we can handle this
> > > through checks in
> > > >> the queue_stop API.
> > > >>
> > > >> Depends.
> > > >> if it would be allowed to link queues only from the same port,
> then
> > > yes, extra
> > > >> checks for queue-stop might be enough.
> > > >> As right now DPDK doesn't allow user to change number of queues
> > > without
> > > >> dev_stop() first.
> > > >> Though if it would be allowed to link queues from different
> ports,
> > > then situation
> > > >> will be much worse.
> > > >> Right now ports are totally independent entities (except some
> > > special cases like
> > > >> link-bonding, etc.).
> > > >> As one port can keep doing RX/TX, second one can be stopped, re-
> > > confgured,
> > > >> even detached, and newly attached device might re-use same port
> > > number.
> > > > I see this as a similar restriction to the one discussed above.
> > >
> > > Yes, they are similar in principal, though I think that the case
> with
> > > queues from different port would make things much more complex.
> > >
> > >  > Do you see any issues if we enforce this with checks?
> > >
> > > Hard to tell straightway, a lot will depend how smart such
> > > implementation would be.
> > > Usually DPDK tends not to avoid heavy
> > > synchronizations within its data-path functions.
> >
> > Certainly! Implementing more and more of such features in the PMDs
> will lead
> > to longer and longer data plane code paths in the PMDs. It is the
> "salami
> > method", where each small piece makes no performance difference, but
> they all
> > add up, and eventually the sum of them does impact the performance of
> the
> > general use case negatively.
> It would be good to have a test running in UNH that shows the
> performance trend.
+1

> 
> >
> > >
> > > >>
> > > >>
> > > >>>>
> > > >>>>>
> > > >>>>>> - very limited usage scenario - it will have a positive
> effect
> > > only
> > > >>>>>>      when we have a fixed forwarding mapping: all (or nearly
> > > all) packets
> > > >>>>>>      from the RX queue are forwarded into the same TX queue.
> > > >>>>>
> > > >>>>> [Feifei] Although the usage scenario is limited, this usage
> > > scenario
> > > >>>>> has a wide range of applications, such as NIC with one port.
> > > >>>>
> > > >>>> yes, there are NICs with one port, but no guarantee there
> > > >>>> wouldn't
> > > be
> > > >>>> several such NICs within the system.
> > > >>> What I see in my interactions is, a single NIC/DPU is under
> > > utilized for a 2
> > > >> socket system. Some are adding more sockets to the system to
> better
> > > utilize the
> > > >> DPU. The NIC bandwidth continues to grow significantly. I do not
> > > think there will
> > > >> be a multi-DPU per server scenario.
> > > >>
> > > >>
> > > >> Interesting... from my experience it is visa-versa:
> > > >> in many cases 200Gb/s is not that much these days to saturate
> > > >> modern
> > > 2 socket
> > > >> x86 server.
> > > >> Though I suppose a lot depends on particular HW and actual
> workload.
> > > >>
> > > >>>
> > > >>>>
> > > >>>>> Furtrhermore, I think this is a tradeoff between performance
> and
> > > >>>>> flexibility.
> > > >>>>> Our goal is to achieve best performance, this means we need
> to
> > > give
> > > >>>>> up some flexibility decisively. For example of 'FAST_FREE
> Mode',
> > > it
> > > >>>>> deletes most of the buffer check (refcnt > 1, external
> buffer,
> > > chain
> > > >>>>> buffer), chooses a shorest path, and then achieve significant
> > > >>>>> performance
> > > >>>> improvement.
> > > >>>>>> Wonder did you had a chance to consider mempool-cache ZC
> API,
> > > >>>>>> similar to one we have for the ring?
> > > >>>>>> It would allow us on TX free path to avoid copying mbufs to
> > > >>>>>> temporary array on the stack.
> > > >>>>>> Instead we can put them straight from TX SW ring to the
> mempool
> > > cache.
> > > >>>>>> That should save extra store/load for mbuf and might help to
> > > >>>>>> achieve some performance gain without by-passing mempool.
> > > >>>>>> It probably wouldn't be as fast as what you proposing, but
> > > >>>>>> might
> > > be
> > > >>>>>> fast enough to consider as alternative.
> > > >>>>>> Again, it would be a generic one, so we can avoid all these
> > > >>>>>> implications and limitations.
> > > >>>>>
> > > >>>>> [Feifei] I think this is a good try. However, the most
> important
> > > >>>>> thing is that if we can bypass the mempool decisively to
> pursue
> > > the
> > > >>>>> significant performance gains.
> > > >>>>
> > > >>>> I understand the intention, and I personally think this is
> wrong
> > > and
> > > >>>> dangerous attitude.
> > > >>>> We have mempool abstraction in place for very good reason.
> > > >>>> So we need to try to improve mempool performance (and API if
> > > >>>> necessary) at first place, not to avoid it and break our own
> > > >>>> rules
> > > and
> > > >> recommendations.
> > > >>> The abstraction can be thought of at a higher level. i.e. the
> > > driver manages the
> > > >> buffer allocation/free and is hidden from the application. The
> > > application does
> > > >> not need to be aware of how these changes are implemented.
> > > >>>
> > > >>>>
> > > >>>>
> > > >>>>> For ZC, there maybe a problem for it in i40e. The reason for
> > > >>>>> that put Tx buffers into temporary is that i40e_tx_entry
> > > >>>>> includes
> > > buffer
> > > >>>>> pointer and index.
> > > >>>>> Thus we cannot put Tx SW_ring entry into mempool directly, we
> > > need
> > > >>>>> to firstlt extract mbuf pointer. Finally, though we use ZC,
> we
> > > still
> > > >>>>> can't avoid using a temporary stack to extract Tx buffer
> > > pointers.
> > > >>>>
> > > >>>> When talking about ZC API for mempool cache I meant something
> > > like:
> > > >>>> void ** mempool_cache_put_zc_start(struct rte_mempool_cache
> *mc,
> > > >>>> uint32_t *nb_elem, uint32_t flags); void
> > > >>>> mempool_cache_put_zc_finish(struct
> > > >>>> rte_mempool_cache *mc, uint32_t nb_elem); i.e. _start_ will
> > > >>>> return user a pointer inside mp-cache where to put free elems
> and
> > > >>>> max
> > > number
> > > >>>> of slots that can be safely filled.
> > > >>>> _finish_ will update mc->len.
> > > >>>> As an example:
> > > >>>>
> > > >>>> /* expect to free N mbufs */
> > > >>>> uint32_t n = N;
> > > >>>> void **p = mempool_cache_put_zc_start(mc, &n, ...);
> > > >>>>
> > > >>>> /* free up to n elems */
> > > >>>> for (i = 0; i != n; i++) {
> > > >>>>
> > > >>>>      /* get next free mbuf from somewhere */
> > > >>>>      mb = extract_and_prefree_mbuf(...);
> > > >>>>
> > > >>>>      /* no more free mbufs for now */
> > > >>>>      if (mb == NULL)
> > > >>>>         break;
> > > >>>>
> > > >>>>      p[i] = mb;
> > > >>>> }
> > > >>>>
> > > >>>> /* finalize ZC put, with _i_ freed elems */
> > > >>>> mempool_cache_put_zc_finish(mc, i);
> > > >>>>
> > > >>>> That way, I think we can overcome the issue with i40e_tx_entry
> > > >>>> you mentioned above. Plus it might be useful in other similar
> places.
> > > >>>>
> > > >>>> Another alternative is obviously to split i40e_tx_entry into
> two
> > > >>>> structs (one for mbuf, second for its metadata) and have a
> > > separate
> > > >>>> array for each of them.
> > > >>>> Though with that approach we need to make sure no perf drops
> will
> > > be
> > > >>>> introduced, plus probably more code changes will be required.
> > > >>> Commit '5171b4ee6b6" already does this (in a different way),
> but
> > > just for
> > > >> AVX512. Unfortunately, it does not record any performance
> > > improvements. We
> > > >> could port this to Arm NEON and look at the performance.
> > > >
> > >

Reply via email to