RE: Optimizations are not features

Morten Brørup Thu, 30 Jun 2022 08:40:14 -0700

> From: Honnappa Nagarahalli [mailto:honnappa.nagaraha...@arm.com]
> Sent: Wednesday, 29 June 2022 22.44
> 
> <snip>
> 
> >
> > 04/06/2022 13:51, Andrew Rybchenko пишет:
> > > On 6/4/22 15:19, Morten Brørup wrote:
> > >>> From: Jerin Jacob [mailto:jerinjac...@gmail.com]
> > >>> Sent: Saturday, 4 June 2022 13.10
> > >>>
> > >>> On Sat, Jun 4, 2022 at 3:30 PM Andrew Rybchenko
> > >>> <andrew.rybche...@oktetlabs.ru> wrote:
> > >>>>
> > >>>> On 6/4/22 12:33, Jerin Jacob wrote:
> > >>>>> On Sat, Jun 4, 2022 at 2:39 PM Morten Brørup
> > >>> <m...@smartsharesystems.com> wrote:
> > >>>>>>
> > >>>>>> I would like the DPDK community to change its view on compile
> > >>>>>> time
> > >>> options. Here is why:
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> Application specific performance micro-optimizations like
> “fast
> > >>> mbuf free” and “mbuf direct re-arm” are being added to DPDK and
> > >>> presented as features.
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> They are not features, but optimizations, and I don’t
> understand
> > >>> the need for them to be available at run-time!
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> Instead of adding a bunch of exotic exceptions to the fast
> path
> > >>>>>> of
> > >>> the PMDs, they should be compile time options. This will improve
> > >>> performance by avoiding branches in the fast path, both for the
> > >>> applications using them, and for generic applications (where the
> > >>> exotic code is omitted).
> > >>>>>
> > >>>>> Agree. I think, keeping the best of both worlds would be
> > >>>>>
> > >>>>> -Enable the feature/optimization as runtime -Have a compile-
> time
> > >>>>> option to disable the feature/optimization as
> > >>> an override.
> > >>>>
> > >>>> It is hard to find the right balance, but in general compile
> time
> > >>>> options are a nightmare for maintenance. Number of required
> builds
> > >>>> will grow as an exponent.
> > >>
> > >> Test combinations are exponential for N features, regardless if N
> are
> > >> runtime or compile time options.
> > >
> > > But since I'm talking about build checks I don't care about
> > > exponential grows in run time. Yes, testing should care, but it is
> a separate
> > story.
> > >
> > >>
> > >>>> Of course, we can
> > >>>> limit number of checked combinations, but it will result in flow
> of
> > >>>> patches to fix build in other cases.
> > >>>
> > >>> The build breakage can be fixed if we use (2) vs (1)
> > >>>
> > >>> 1)
> > >>> #ifdef ...
> > >>> My feature
> > >>> #endif
> > >>>
> > >>> 2)
> > >>> static __rte_always_inline int
> > >>> rte_has_xyz_feature(void)
> > >>> {
> > >>> #ifdef RTE_LIBRTE_XYZ_FEATURE
> > >>>          return RTE_LIBRTE_XYZ_FEATURE; #else
> > >>>          return 0;
> > >>> #endif
> > >>> }
> > >>>
> > >>> if(rte_has_xyz_feature())) {
> > >>> My feature code
> > >>>
> > >>> }
> > >>>
> > >
> > > Jerin, thanks, very good example.
> > >
> > >> I'm not sure all the features can be covered by that, e.g. added
> > >> fields in structures.
> > >
> > > +1
> > >
> > >>
> > >> Also, I would consider such features "opt in" at compile time
> only.
> > >> As such, they could be allowed to break the ABI/API.
> > >>
> > >>>
> > >>>
> > >>>> Also compile time options tend to make code less readable which
> > >>>> makes all aspects of the development harder.
> > >>>>
> > >>>> Yes, compile time is nice for micro optimizations, but I have
> great
> > >>>> concerns that it is a right way to go.
> > >>>>
> > >>>>>> Please note that I am only talking about the performance
> > >>> optimizations that are limited to application specific use cases.
> I
> > >>> think it makes sense to require that performance optimizing an
> > >>> application also requires recompiling the performance critical
> > >>> libraries used by it.
> > >>>>>>abandon some of existing functionality to create a 'short-cut'
> > >>>>>>
> > >>>>>>
> > >>>>>> Allowing compile time options for application specific
> > >>>>>> performance
> > >>> optimizations in DPDK would also open a path for other
> > >>> optimizations, which can only be achieved at compile time, such
> as
> > >>> “no fragmented packets”, “no attached mbufs” and “single mbuf
> pool”.
> > >>> And even more exotic optimizations, such as the “indexed mempool
> > >>> cache”, which was rejected due to ABI violations – they could be
> > >>> marked as “risky and untested” or similar, but still be part of
> the DPDK main
> > repository.
> > >>>>>>
> >
> >
> > Thanks Morten for bringing it up, it is an interesting topic.
> > Though I look at it from different angle.
> > All optimizations you mentioned above introduce new limitations:
> > MBUF_FAST_FREE - no indirect mbufs and multiple mempools, mempool
> object
> > indexes - mempool size is limited to 4GB, direct rearm - drop ability
> to
> > stop/reconfigure TX queue, while RX queue is still running, etc.
> > Note that all these limitations are not forced by HW.
> > All of them are pure SW limitations that developers forced in (or
> tried to) to get
> > few extra performance.
> > That's concerning tendency.
> >
> > As more and more such 'optimization via limitation' will come in:
> > - DPDK feature list will become more and more fragmented.
> > - Would cause more and more confusion for the users.
> > - Unmet expectations - difference in performance between 'default'
> >    and 'optimized' version of DPDK will become bigger and bigger.


I strongly disagree with this bullet!

We should not limit the performance to only what is possible with all features 
enabled.

An application developer should have the ability to disable performance-costly 
features not being used.

> > - As Andrew already mentioned, maintaining all these 'sub-flavours'
> >    of DPDK will become more and more difficult.
> The point that we need to remember is, these features/optimizations are
> introduced after seeing performance issues in practical use cases.
> DPDK is not being used in just one use case, it is being used in
> several use cases which have their own unique requirements. Is 4GB
> enough for packet buffers - yes it is enough in certain use cases. Are
> their NICs with single port - yes there are. HW is being created
> because use cases and business cases exist. It is obvious that as DPDK
> gets adopted on more platforms that differ largely, the features will
> increase and it will become complex. Complexity should not be used as a
> criteria to reject patches.
> 
> There is different perspective to what you are calling as
> 'limitations'. I can argue that multiple mempools, stop/reconfigure TX
> queue while RX queue is still running are exotic. Just because those
> are allowed currently (probably accidently) does not mean they are
> being used. Are there use cases that make use of these features?
> 
> The base/existing design for DPDK was done with one particular HW
> architecture in mind where there was an abundance of resources.
> Unfortunately, that HW architecture is fast evolving and DPDK is
> adopted in use cases where that kind of resources are not available.
> For ex: efficiency cores are being introduced by every CPU vendor now.
> Soon enough, we will see big-little architecture in networking as well.
> The existing PMD design introduces 512B of stores (256B for copying to
> stack variable and 256B to store lcore cache) and 256B load/store on RX
> side every 32 packets back to back. It doesn't make sense to have that
> kind of memcopy for little/efficiency cores just for the driver code.
> 
> >
> > So, probably instead of making such changes easier, we need somehow
> to
> > persuade developers to think more about optimizations that would be
> generic
> > and transparent to the user.
> Or may be we need to think of creating alternate ways of programming.

Exactly what I was hoping to achieve with this discussion.

> 
> > I do realize that it is not always possible due to various reasons
> (HW limitations,
> > external dependencies, etc.) but that's another story.
> >
> > Let's take for example MBUF_FAST_FREE.
> > In fact, I am not sure that we need it as tx offload flag at all.
> > PMD TX-path has all necessary information to decide at run-time can
> it do
> > fast_free() for not:
> > At tx_burst() PMD can check are all mbufs satisfy these conditions
> (same
> > mempool, refcnt==1) and update some fields and/or counters inside TXQ
> to
> > reflect it.
> > Then, at tx_free() we can use this info to decide between fast_free()
> and
> > normal_free().
> > As at tx_burst() we read mbuf fields anyway, impact for this extra
> step I guess
> > would be minimal.
> > Yes, most likely, it wouldn't be as fast as with current TX offload
> flag, or
> > conditional compilation approach.
> > But it might be still significantly faster then normal_free(), plus
> such approach
> > will be generic and transparent to the user.
> IMO, this depends on the philosophy that we want to adopt. I would
> prefer to make control plane complex for performance gains on the data
> plane. The performance on the data plane has a multiplying effect due
> to the ratio of number of cores assigned for data plane vs control
> plane.

Yes. And if some performance-costing feature is not possible to move out from 
the data plane to the control plane, it should be compile time optional.

And please note that I don't buy the argument that "it will be caught by branch 
prediction". You are not allowed to fill up my branch predictor table with 
cruft!

> 
> I am not against evaluating alternatives, but the alternative
> approaches need to have similar (not the same) performance.
> 
> >
> > Konstantin

RE: Optimizations are not features

Reply via email to