RE: [RFC] cache guard

Morten Brørup Mon, 04 Sep 2023 05:48:55 -0700

> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se]
> Sent: Monday, 4 September 2023 14.07
> 
> On 2023-09-01 20:52, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se]
> >> Sent: Friday, 1 September 2023 18.58
> >>
> >> On 2023-09-01 14:26, Thomas Monjalon wrote:
> >>> 27/08/2023 10:34, Morten Brørup:
> >>>> +CC Honnappa and Konstantin, Ring lib maintainers
> >>>> +CC Mattias, PRNG lib maintainer
> >>>>
> >>>>> From: Bruce Richardson [mailto:bruce.richard...@intel.com]
> >>>>> Sent: Friday, 25 August 2023 11.24
> >>>>>
> >>>>> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:
> >>>>>> +CC mempool maintainers
> >>>>>>
> >>>>>>> From: Bruce Richardson [mailto:bruce.richard...@intel.com]
> >>>>>>> Sent: Friday, 25 August 2023 10.23
> >>>>>>>
> >>>>>>> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
> >>>>>>>> Bruce,
> >>>>>>>>
> >>>>>>>> With this patch [1], it is noted that the ring producer and
> >>>>> consumer data
> >>>>>>> should not be on adjacent cache lines, for performance reasons.
> >>>>>>>>
> >>>>>>>> [1]:
> >>>>>>>
> >>>>>
> >>
> https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f
> >>>>> fd4b66
> >>>>>>> e75485cc8b63b9aedfbdfe8b0
> >>>>>>>>
> >>>>>>>> (It's obvious that they cannot share the same cache line,
> because
> >>>>> they are
> >>>>>>> accessed by two different threads.)
> >>>>>>>>
> >>>>>>>> Intuitively, I would think that having them on different cache
> >>>>> lines would
> >>>>>>> suffice. Why does having an empty cache line between them make a
> >>>>> difference?
> >>>>>>>>
> >>>>>>>> And does it need to be an empty cache line? Or does it suffice
> >>>>> having the
> >>>>>>> second structure start at two cache lines after the start of the
> >>>>> first
> >>>>>>> structure (e.g. if the size of the first structure is two cache
> >>>>> lines)?
> >>>>>>>>
> >>>>>>>> I'm asking because the same principle might apply to other code
> >>>>> too.
> >>>>>>>>
> >>>>>>> Hi Morten,
> >>>>>>>
> >>>>>>> this was something we discovered when working on the distributor
> >>>>> library.
> >>>>>>> If we have cachelines per core where there is heavy access,
> having
> >>>>> some
> >>>>>>> cachelines as a gap between the content cachelines can help
> >>>>> performance. We
> >>>>>>> believe this helps due to avoiding issues with the HW
> prefetchers
> >>>>> (e.g.
> >>>>>>> adjacent cacheline prefetcher) bringing in the second cacheline
> >>>>>>> speculatively when an operation is done on the first line.
> >>>>>>
> >>>>>> I guessed that it had something to do with speculative
> prefetching,
> >>>>> but wasn't sure. Good to get confirmation, and that it has a
> >> measureable
> >>>>> effect somewhere. Very interesting!
> >>>>>>
> >>>>>> NB: More comments in the ring lib about stuff like this would be
> >> nice.
> >>>>>>
> >>>>>> So, for the mempool lib, what do you think about applying the
> same
> >>>>> technique to the rte_mempool_debug_stats structure (which is an
> >> array
> >>>>> indexed per lcore)... Two adjacent lcores heavily accessing their
> >> local
> >>>>> mempool caches seems likely to me. But how heavy does the access
> >> need to
> >>>>> be for this technique to be relevant?
> >>>>>>
> >>>>>
> >>>>> No idea how heavy the accesses need to be for this to have a
> >> noticable
> >>>>> effect. For things like debug stats, I wonder how worthwhile
> making
> >> such
> >>>>> a
> >>>>> change would be, but then again, any change would have very low
> >> impact
> >>>>> too
> >>>>> in that case.
> >>>>
> >>>> I just tried adding padding to some of the hot structures in our
> own
> >> application, and observed a significant performance improvement for
> >> those.
> >>>>
> >>>> So I think this technique should have higher visibility in DPDK by
> >> adding a new cache macro to rte_common.h:
> >>>
> >>> +1 to make more visibility in doc and adding a macro, good idea!
> >>>
> >>>
> >>>
> >>
> >> A worry I have is that for CPUs with large (in this context) N, you
> will
> >> end up with a lot of padding to avoid next-N-lines false sharing.
> That
> >> would be padding after, and in the general (non-array) case also
> before,
> >> the actual per-lcore data. A slight nuisance is also that those
> >> prefetched lines of padding, will never contain anything useful, and
> >> thus fetching them will always be a waste.
> >
> > Out of curiosity, what is the largest N anyone here on the list is
> aware of?
> >
> >>
> >> Padding/alignment may not be the only way to avoid HW-prefetcher-
> induced
> >> false sharing for per-lcore data structures.
> >>
> >> What we are discussing here is organizing the statically allocated
> >> per-lcore structs of a particular module in an array with the
> >> appropriate padding/alignment. In this model, all data related to a
> >> particular module is close (memory address/page-wise), but not so
> close
> >> to cause false sharing.
> >>
> >> /* rte_a.c */
> >>
> >> struct rte_a_state
> >> {
> >>    int x;
> >>           RTE_CACHE_GUARD;
> >> } __rte_cache_aligned;
> >>
> >> static struct rte_a_state a_states[RTE_MAX_LCORE];
> >>
> >> /* rte_b.c */
> >>
> >> struct rte_b_state
> >> {
> >>    char y;
> >>           char z;
> >>           RTE_CACHE_GUARD;
> >> } __rte_cache_aligned;
> >>
> >>
> >> static struct rte_b_state b_states[RTE_MAX_LCORE];
> >>
> >> What you would end up with in runtime when the linker has done its
> job
> >> is something that essentially looks like this (in memory):
> >>
> >> struct {
> >>    struct rte_a_state a_states[RTE_MAX_LCORE];
> >>    struct rte_b_state b_states[RTE_MAX_LCORE];
> >> };
> >>
> >> You could consider turning it around, and keeping data (i.e., module
> >> structs) related to a particular lcore, for all modules, close. In
> other
> >> words, keeping a per-lcore arrays of variable-sized elements.
> >>
> >> So, something that will end up looking like this (in memory, not in
> the
> >> source code):
> >>
> >> struct rte_lcore_state
> >> {
> >>    struct rte_a_state a_state;
> >>    struct rte_b_state b_state;
> >>           RTE_CACHE_GUARD;
> >> };
> >>
> >> struct rte_lcore_state lcore_states[RTE_LCORE_MAX];
> >>
> >> In such a scenario, the per-lcore struct type for a module need not
> (and
> >> should not) be cache-line-aligned (but may still have some alignment
> >> requirements). Data will be more tightly packed, and the "next lines"
> >> prefetched may actually be useful (although I'm guessing in practice
> >> they will usually not).
> >>
> >> There may be several ways to implement that scheme. The above is to
> >> illustrate how thing would look in memory, not necessarily on the
> level
> >> of the source code.
> >>
> >> One way could be to fit the per-module-per-lcore struct in a chunk of
> >> memory allocated in a per-lcore heap. In such a case, the DPDK heap
> >> would need extension, maybe with semantics similar to that of NUMA-
> node
> >> specific allocations.
> >>
> >> Another way would be to use thread-local storage (TLS, __thread),
> >> although it's unclear to me how well TLS works with larger data
> >> structures.
> >>
> >> A third way may be to somehow achieve something that looks like the
> >> above example, using macros, without breaking module encapsulation or
> >> generally be too intrusive or otherwise cumbersome.
> >>
> >> Not sure this is worth the trouble (compared to just more padding),
> but
> >> I thought it was an idea worth sharing.
> >
> > I think what Mattias suggests is relevant, and it would be great if a
> generic solution could be found for DPDK.
> >
> > For reference, we initially used RTE_PER_LCORE(module_variable), i.e.
> thread local storage, extensively in our application modules. But it has
> two disadvantages:
> > 1. TLS does not use hugepages. (The same applies to global and local
> variables, BTW.)
> > 2. We need to set up global pointers to these TLS variables, so they
> can be accessed from the main lcore (e.g. for statistics). This means
> that every module needs some sort of module_per_lcore_init, called by
> the thread after its creation, to set the
> module_global_ptr[rte_lcore_id()] = &RTE_PER_LCORE(module_variable).
> >
> 
> Good points. I never thought about the initialization issue.
> 
> How about memory consumption and TLS? If you have many non-EAL-threads
> in the DPDK process, would the system allocate TLS memory for DPDK
> lcore-specific data structures? Assuming a scenario where __thread was
> used instead of the standard DPDK pattern.


Room for all the __thread variables is allocated and initialized for every 
thread started.

Note: RTE_PER_LCORE variables are simply __thread-wrapped variables:

#define RTE_DEFINE_PER_LCORE(type, name)                        \
        __thread __typeof__(type) per_lcore_##name

> 
> > Eventually, we gave up and migrated to the DPDK standard design
> pattern of instantiating a global module_variable[RTE_MAX_LCORE], and
> letting each thread use its own entry in that array.
> >
> > And as Mattias suggests, this adds a lot of useless padding, because
> each modules' variables now need to start on its own cache line.
> >
> > So a generic solution with packed per-thread data would be a great
> long term solution.
> >
> > Short term, I can live with the simple cache guard. It is very easy to
> implement and use.
> >
> The RTE_CACHE_GUARD pattern is also intrusive, in the sense it needs to
> be explicitly added everywhere (just like __rte_cache_aligned) and error
> prone, and somewhat brittle (in the face of changed <N>).

Agree. Like a lot of other stuff in DPDK. DPDK is very explicit. :-)

> 
> (I mentioned this not to discourage the use of RTE_CACHE_GUARD - more to
> encourage somehow to invite something more efficient, robust and
> easier-to-use.)

I get that you don't discourage it, but I don't expect the discussion to 
proceed further at this time. So please don’t forget to also ACK this patch. ;-)

There should be a wish list, where concepts like your suggested improvement 
could be easily noted.

RE: [RFC] cache guard

Reply via email to