> From: Honnappa Nagarahalli [mailto:honnappa.nagaraha...@arm.com] > Sent: Wednesday, 29 June 2022 22.44 > > <snip> > > > > > 04/06/2022 13:51, Andrew Rybchenko пишет: > > > On 6/4/22 15:19, Morten Brørup wrote: > > >>> From: Jerin Jacob [mailto:jerinjac...@gmail.com] > > >>> Sent: Saturday, 4 June 2022 13.10 > > >>> > > >>> On Sat, Jun 4, 2022 at 3:30 PM Andrew Rybchenko > > >>> <andrew.rybche...@oktetlabs.ru> wrote: > > >>>> > > >>>> On 6/4/22 12:33, Jerin Jacob wrote: > > >>>>> On Sat, Jun 4, 2022 at 2:39 PM Morten Brørup > > >>> <m...@smartsharesystems.com> wrote: > > >>>>>> > > >>>>>> I would like the DPDK community to change its view on compile > > >>>>>> time > > >>> options. Here is why: > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> Application specific performance micro-optimizations like > “fast > > >>> mbuf free” and “mbuf direct re-arm” are being added to DPDK and > > >>> presented as features. > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> They are not features, but optimizations, and I don’t > understand > > >>> the need for them to be available at run-time! > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> Instead of adding a bunch of exotic exceptions to the fast > path > > >>>>>> of > > >>> the PMDs, they should be compile time options. This will improve > > >>> performance by avoiding branches in the fast path, both for the > > >>> applications using them, and for generic applications (where the > > >>> exotic code is omitted). > > >>>>> > > >>>>> Agree. I think, keeping the best of both worlds would be > > >>>>> > > >>>>> -Enable the feature/optimization as runtime -Have a compile- > time > > >>>>> option to disable the feature/optimization as > > >>> an override. > > >>>> > > >>>> It is hard to find the right balance, but in general compile > time > > >>>> options are a nightmare for maintenance. Number of required > builds > > >>>> will grow as an exponent. > > >> > > >> Test combinations are exponential for N features, regardless if N > are > > >> runtime or compile time options. > > > > > > But since I'm talking about build checks I don't care about > > > exponential grows in run time. Yes, testing should care, but it is > a separate > > story. > > > > > >> > > >>>> Of course, we can > > >>>> limit number of checked combinations, but it will result in flow > of > > >>>> patches to fix build in other cases. > > >>> > > >>> The build breakage can be fixed if we use (2) vs (1) > > >>> > > >>> 1) > > >>> #ifdef ... > > >>> My feature > > >>> #endif > > >>> > > >>> 2) > > >>> static __rte_always_inline int > > >>> rte_has_xyz_feature(void) > > >>> { > > >>> #ifdef RTE_LIBRTE_XYZ_FEATURE > > >>> return RTE_LIBRTE_XYZ_FEATURE; #else > > >>> return 0; > > >>> #endif > > >>> } > > >>> > > >>> if(rte_has_xyz_feature())) { > > >>> My feature code > > >>> > > >>> } > > >>> > > > > > > Jerin, thanks, very good example. > > > > > >> I'm not sure all the features can be covered by that, e.g. added > > >> fields in structures. > > > > > > +1 > > > > > >> > > >> Also, I would consider such features "opt in" at compile time > only. > > >> As such, they could be allowed to break the ABI/API. > > >> > > >>> > > >>> > > >>>> Also compile time options tend to make code less readable which > > >>>> makes all aspects of the development harder. > > >>>> > > >>>> Yes, compile time is nice for micro optimizations, but I have > great > > >>>> concerns that it is a right way to go. > > >>>> > > >>>>>> Please note that I am only talking about the performance > > >>> optimizations that are limited to application specific use cases. > I > > >>> think it makes sense to require that performance optimizing an > > >>> application also requires recompiling the performance critical > > >>> libraries used by it. > > >>>>>>abandon some of existing functionality to create a 'short-cut' > > >>>>>> > > >>>>>> > > >>>>>> Allowing compile time options for application specific > > >>>>>> performance > > >>> optimizations in DPDK would also open a path for other > > >>> optimizations, which can only be achieved at compile time, such > as > > >>> “no fragmented packets”, “no attached mbufs” and “single mbuf > pool”. > > >>> And even more exotic optimizations, such as the “indexed mempool > > >>> cache”, which was rejected due to ABI violations – they could be > > >>> marked as “risky and untested” or similar, but still be part of > the DPDK main > > repository. > > >>>>>> > > > > > > Thanks Morten for bringing it up, it is an interesting topic. > > Though I look at it from different angle. > > All optimizations you mentioned above introduce new limitations: > > MBUF_FAST_FREE - no indirect mbufs and multiple mempools, mempool > object > > indexes - mempool size is limited to 4GB, direct rearm - drop ability > to > > stop/reconfigure TX queue, while RX queue is still running, etc. > > Note that all these limitations are not forced by HW. > > All of them are pure SW limitations that developers forced in (or > tried to) to get > > few extra performance. > > That's concerning tendency. > > > > As more and more such 'optimization via limitation' will come in: > > - DPDK feature list will become more and more fragmented. > > - Would cause more and more confusion for the users. > > - Unmet expectations - difference in performance between 'default' > > and 'optimized' version of DPDK will become bigger and bigger.
I strongly disagree with this bullet! We should not limit the performance to only what is possible with all features enabled. An application developer should have the ability to disable performance-costly features not being used. > > - As Andrew already mentioned, maintaining all these 'sub-flavours' > > of DPDK will become more and more difficult. > The point that we need to remember is, these features/optimizations are > introduced after seeing performance issues in practical use cases. > DPDK is not being used in just one use case, it is being used in > several use cases which have their own unique requirements. Is 4GB > enough for packet buffers - yes it is enough in certain use cases. Are > their NICs with single port - yes there are. HW is being created > because use cases and business cases exist. It is obvious that as DPDK > gets adopted on more platforms that differ largely, the features will > increase and it will become complex. Complexity should not be used as a > criteria to reject patches. > > There is different perspective to what you are calling as > 'limitations'. I can argue that multiple mempools, stop/reconfigure TX > queue while RX queue is still running are exotic. Just because those > are allowed currently (probably accidently) does not mean they are > being used. Are there use cases that make use of these features? > > The base/existing design for DPDK was done with one particular HW > architecture in mind where there was an abundance of resources. > Unfortunately, that HW architecture is fast evolving and DPDK is > adopted in use cases where that kind of resources are not available. > For ex: efficiency cores are being introduced by every CPU vendor now. > Soon enough, we will see big-little architecture in networking as well. > The existing PMD design introduces 512B of stores (256B for copying to > stack variable and 256B to store lcore cache) and 256B load/store on RX > side every 32 packets back to back. It doesn't make sense to have that > kind of memcopy for little/efficiency cores just for the driver code. > > > > > So, probably instead of making such changes easier, we need somehow > to > > persuade developers to think more about optimizations that would be > generic > > and transparent to the user. > Or may be we need to think of creating alternate ways of programming. Exactly what I was hoping to achieve with this discussion. > > > I do realize that it is not always possible due to various reasons > (HW limitations, > > external dependencies, etc.) but that's another story. > > > > Let's take for example MBUF_FAST_FREE. > > In fact, I am not sure that we need it as tx offload flag at all. > > PMD TX-path has all necessary information to decide at run-time can > it do > > fast_free() for not: > > At tx_burst() PMD can check are all mbufs satisfy these conditions > (same > > mempool, refcnt==1) and update some fields and/or counters inside TXQ > to > > reflect it. > > Then, at tx_free() we can use this info to decide between fast_free() > and > > normal_free(). > > As at tx_burst() we read mbuf fields anyway, impact for this extra > step I guess > > would be minimal. > > Yes, most likely, it wouldn't be as fast as with current TX offload > flag, or > > conditional compilation approach. > > But it might be still significantly faster then normal_free(), plus > such approach > > will be generic and transparent to the user. > IMO, this depends on the philosophy that we want to adopt. I would > prefer to make control plane complex for performance gains on the data > plane. The performance on the data plane has a multiplying effect due > to the ratio of number of cores assigned for data plane vs control > plane. Yes. And if some performance-costing feature is not possible to move out from the data plane to the control plane, it should be compile time optional. And please note that I don't buy the argument that "it will be caught by branch prediction". You are not allowed to fill up my branch predictor table with cruft! > > I am not against evaluating alternatives, but the alternative > approaches need to have similar (not the same) performance. > > > > > Konstantin