> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se] > Sent: Monday, 4 September 2023 14.07 > > On 2023-09-01 20:52, Morten Brørup wrote: > >> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se] > >> Sent: Friday, 1 September 2023 18.58 > >> > >> On 2023-09-01 14:26, Thomas Monjalon wrote: > >>> 27/08/2023 10:34, Morten Brørup: > >>>> +CC Honnappa and Konstantin, Ring lib maintainers > >>>> +CC Mattias, PRNG lib maintainer > >>>> > >>>>> From: Bruce Richardson [mailto:bruce.richard...@intel.com] > >>>>> Sent: Friday, 25 August 2023 11.24 > >>>>> > >>>>> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote: > >>>>>> +CC mempool maintainers > >>>>>> > >>>>>>> From: Bruce Richardson [mailto:bruce.richard...@intel.com] > >>>>>>> Sent: Friday, 25 August 2023 10.23 > >>>>>>> > >>>>>>> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote: > >>>>>>>> Bruce, > >>>>>>>> > >>>>>>>> With this patch [1], it is noted that the ring producer and > >>>>> consumer data > >>>>>>> should not be on adjacent cache lines, for performance reasons. > >>>>>>>> > >>>>>>>> [1]: > >>>>>>> > >>>>> > >> > https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f > >>>>> fd4b66 > >>>>>>> e75485cc8b63b9aedfbdfe8b0 > >>>>>>>> > >>>>>>>> (It's obvious that they cannot share the same cache line, > because > >>>>> they are > >>>>>>> accessed by two different threads.) > >>>>>>>> > >>>>>>>> Intuitively, I would think that having them on different cache > >>>>> lines would > >>>>>>> suffice. Why does having an empty cache line between them make a > >>>>> difference? > >>>>>>>> > >>>>>>>> And does it need to be an empty cache line? Or does it suffice > >>>>> having the > >>>>>>> second structure start at two cache lines after the start of the > >>>>> first > >>>>>>> structure (e.g. if the size of the first structure is two cache > >>>>> lines)? > >>>>>>>> > >>>>>>>> I'm asking because the same principle might apply to other code > >>>>> too. > >>>>>>>> > >>>>>>> Hi Morten, > >>>>>>> > >>>>>>> this was something we discovered when working on the distributor > >>>>> library. > >>>>>>> If we have cachelines per core where there is heavy access, > having > >>>>> some > >>>>>>> cachelines as a gap between the content cachelines can help > >>>>> performance. We > >>>>>>> believe this helps due to avoiding issues with the HW > prefetchers > >>>>> (e.g. > >>>>>>> adjacent cacheline prefetcher) bringing in the second cacheline > >>>>>>> speculatively when an operation is done on the first line. > >>>>>> > >>>>>> I guessed that it had something to do with speculative > prefetching, > >>>>> but wasn't sure. Good to get confirmation, and that it has a > >> measureable > >>>>> effect somewhere. Very interesting! > >>>>>> > >>>>>> NB: More comments in the ring lib about stuff like this would be > >> nice. > >>>>>> > >>>>>> So, for the mempool lib, what do you think about applying the > same > >>>>> technique to the rte_mempool_debug_stats structure (which is an > >> array > >>>>> indexed per lcore)... Two adjacent lcores heavily accessing their > >> local > >>>>> mempool caches seems likely to me. But how heavy does the access > >> need to > >>>>> be for this technique to be relevant? > >>>>>> > >>>>> > >>>>> No idea how heavy the accesses need to be for this to have a > >> noticable > >>>>> effect. For things like debug stats, I wonder how worthwhile > making > >> such > >>>>> a > >>>>> change would be, but then again, any change would have very low > >> impact > >>>>> too > >>>>> in that case. > >>>> > >>>> I just tried adding padding to some of the hot structures in our > own > >> application, and observed a significant performance improvement for > >> those. > >>>> > >>>> So I think this technique should have higher visibility in DPDK by > >> adding a new cache macro to rte_common.h: > >>> > >>> +1 to make more visibility in doc and adding a macro, good idea! > >>> > >>> > >>> > >> > >> A worry I have is that for CPUs with large (in this context) N, you > will > >> end up with a lot of padding to avoid next-N-lines false sharing. > That > >> would be padding after, and in the general (non-array) case also > before, > >> the actual per-lcore data. A slight nuisance is also that those > >> prefetched lines of padding, will never contain anything useful, and > >> thus fetching them will always be a waste. > > > > Out of curiosity, what is the largest N anyone here on the list is > aware of? > > > >> > >> Padding/alignment may not be the only way to avoid HW-prefetcher- > induced > >> false sharing for per-lcore data structures. > >> > >> What we are discussing here is organizing the statically allocated > >> per-lcore structs of a particular module in an array with the > >> appropriate padding/alignment. In this model, all data related to a > >> particular module is close (memory address/page-wise), but not so > close > >> to cause false sharing. > >> > >> /* rte_a.c */ > >> > >> struct rte_a_state > >> { > >> int x; > >> RTE_CACHE_GUARD; > >> } __rte_cache_aligned; > >> > >> static struct rte_a_state a_states[RTE_MAX_LCORE]; > >> > >> /* rte_b.c */ > >> > >> struct rte_b_state > >> { > >> char y; > >> char z; > >> RTE_CACHE_GUARD; > >> } __rte_cache_aligned; > >> > >> > >> static struct rte_b_state b_states[RTE_MAX_LCORE]; > >> > >> What you would end up with in runtime when the linker has done its > job > >> is something that essentially looks like this (in memory): > >> > >> struct { > >> struct rte_a_state a_states[RTE_MAX_LCORE]; > >> struct rte_b_state b_states[RTE_MAX_LCORE]; > >> }; > >> > >> You could consider turning it around, and keeping data (i.e., module > >> structs) related to a particular lcore, for all modules, close. In > other > >> words, keeping a per-lcore arrays of variable-sized elements. > >> > >> So, something that will end up looking like this (in memory, not in > the > >> source code): > >> > >> struct rte_lcore_state > >> { > >> struct rte_a_state a_state; > >> struct rte_b_state b_state; > >> RTE_CACHE_GUARD; > >> }; > >> > >> struct rte_lcore_state lcore_states[RTE_LCORE_MAX]; > >> > >> In such a scenario, the per-lcore struct type for a module need not > (and > >> should not) be cache-line-aligned (but may still have some alignment > >> requirements). Data will be more tightly packed, and the "next lines" > >> prefetched may actually be useful (although I'm guessing in practice > >> they will usually not). > >> > >> There may be several ways to implement that scheme. The above is to > >> illustrate how thing would look in memory, not necessarily on the > level > >> of the source code. > >> > >> One way could be to fit the per-module-per-lcore struct in a chunk of > >> memory allocated in a per-lcore heap. In such a case, the DPDK heap > >> would need extension, maybe with semantics similar to that of NUMA- > node > >> specific allocations. > >> > >> Another way would be to use thread-local storage (TLS, __thread), > >> although it's unclear to me how well TLS works with larger data > >> structures. > >> > >> A third way may be to somehow achieve something that looks like the > >> above example, using macros, without breaking module encapsulation or > >> generally be too intrusive or otherwise cumbersome. > >> > >> Not sure this is worth the trouble (compared to just more padding), > but > >> I thought it was an idea worth sharing. > > > > I think what Mattias suggests is relevant, and it would be great if a > generic solution could be found for DPDK. > > > > For reference, we initially used RTE_PER_LCORE(module_variable), i.e. > thread local storage, extensively in our application modules. But it has > two disadvantages: > > 1. TLS does not use hugepages. (The same applies to global and local > variables, BTW.) > > 2. We need to set up global pointers to these TLS variables, so they > can be accessed from the main lcore (e.g. for statistics). This means > that every module needs some sort of module_per_lcore_init, called by > the thread after its creation, to set the > module_global_ptr[rte_lcore_id()] = &RTE_PER_LCORE(module_variable). > > > > Good points. I never thought about the initialization issue. > > How about memory consumption and TLS? If you have many non-EAL-threads > in the DPDK process, would the system allocate TLS memory for DPDK > lcore-specific data structures? Assuming a scenario where __thread was > used instead of the standard DPDK pattern.
Room for all the __thread variables is allocated and initialized for every thread started. Note: RTE_PER_LCORE variables are simply __thread-wrapped variables: #define RTE_DEFINE_PER_LCORE(type, name) \ __thread __typeof__(type) per_lcore_##name > > > Eventually, we gave up and migrated to the DPDK standard design > pattern of instantiating a global module_variable[RTE_MAX_LCORE], and > letting each thread use its own entry in that array. > > > > And as Mattias suggests, this adds a lot of useless padding, because > each modules' variables now need to start on its own cache line. > > > > So a generic solution with packed per-thread data would be a great > long term solution. > > > > Short term, I can live with the simple cache guard. It is very easy to > implement and use. > > > The RTE_CACHE_GUARD pattern is also intrusive, in the sense it needs to > be explicitly added everywhere (just like __rte_cache_aligned) and error > prone, and somewhat brittle (in the face of changed <N>). Agree. Like a lot of other stuff in DPDK. DPDK is very explicit. :-) > > (I mentioned this not to discourage the use of RTE_CACHE_GUARD - more to > encourage somehow to invite something more efficient, robust and > easier-to-use.) I get that you don't discourage it, but I don't expect the discussion to proceed further at this time. So please don’t forget to also ACK this patch. ;-) There should be a wish list, where concepts like your suggested improvement could be easily noted.