Re: [RFC] cache guard

Mattias Rönnblom Fri, 01 Sep 2023 09:57:53 -0700

On 2023-09-01 14:26, Thomas Monjalon wrote:

27/08/2023 10:34, Morten Brørup:

+CC Honnappa and Konstantin, Ring lib maintainers
+CC Mattias, PRNG lib maintainer

From: Bruce Richardson [mailto:bruce.richard...@intel.com]
Sent: Friday, 25 August 2023 11.24

On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:

+CC mempool maintainers

From: Bruce Richardson [mailto:bruce.richard...@intel.com]
Sent: Friday, 25 August 2023 10.23

On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:

Bruce,

With this patch [1], it is noted that the ring producer and

consumer data

should not be on adjacent cache lines, for performance reasons.


[1]:

https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f
fd4b66

e75485cc8b63b9aedfbdfe8b0


(It's obvious that they cannot share the same cache line, because

they are

accessed by two different threads.)


Intuitively, I would think that having them on different cache

lines would

suffice. Why does having an empty cache line between them make a

difference?


And does it need to be an empty cache line? Or does it suffice

having the

second structure start at two cache lines after the start of the

first

structure (e.g. if the size of the first structure is two cache

lines)?


I'm asking because the same principle might apply to other code

too.

Hi Morten,

this was something we discovered when working on the distributor

library.

If we have cachelines per core where there is heavy access, having

some

cachelines as a gap between the content cachelines can help

performance. We

believe this helps due to avoiding issues with the HW prefetchers

(e.g.

adjacent cacheline prefetcher) bringing in the second cacheline
speculatively when an operation is done on the first line.


I guessed that it had something to do with speculative prefetching,

but wasn't sure. Good to get confirmation, and that it has a measureable
effect somewhere. Very interesting!


NB: More comments in the ring lib about stuff like this would be nice.

So, for the mempool lib, what do you think about applying the same

technique to the rte_mempool_debug_stats structure (which is an array
indexed per lcore)... Two adjacent lcores heavily accessing their local
mempool caches seems likely to me. But how heavy does the access need to
be for this technique to be relevant?


No idea how heavy the accesses need to be for this to have a noticable
effect. For things like debug stats, I wonder how worthwhile making such
a
change would be, but then again, any change would have very low impact
too
in that case.


I just tried adding padding to some of the hot structures in our own 
application, and observed a significant performance improvement for those.

So I think this technique should have higher visibility in DPDK by adding a new 
cache macro to rte_common.h:


+1 to make more visibility in doc and adding a macro, good idea!

A worry I have is that for CPUs with large (in this context) N, you willend up with a lot of padding to avoid next-N-lines false sharing. Thatwould be padding after, and in the general (non-array) case also before,the actual per-lcore data. A slight nuisance is also that thoseprefetched lines of padding, will never contain anything useful, andthus fetching them will always be a waste.

Padding/alignment may not be the only way to avoid HW-prefetcher-inducedfalse sharing for per-lcore data structures.

What we are discussing here is organizing the statically allocatedper-lcore structs of a particular module in an array with theappropriate padding/alignment. In this model, all data related to aparticular module is close (memory address/page-wise), but not so closeto cause false sharing.


/* rte_a.c */

struct rte_a_state
{
        int x;
        RTE_CACHE_GUARD;
} __rte_cache_aligned;

static struct rte_a_state a_states[RTE_MAX_LCORE];

/* rte_b.c */

struct rte_b_state
{
        char y;
        char z;
        RTE_CACHE_GUARD;
} __rte_cache_aligned;


static struct rte_b_state b_states[RTE_MAX_LCORE];

What you would end up with in runtime when the linker has done its jobis something that essentially looks like this (in memory):


struct {
        struct rte_a_state a_states[RTE_MAX_LCORE];
        struct rte_b_state b_states[RTE_MAX_LCORE];
};

You could consider turning it around, and keeping data (i.e., modulestructs) related to a particular lcore, for all modules, close. In otherwords, keeping a per-lcore arrays of variable-sized elements.

So, something that will end up looking like this (in memory, not in thesource code):


struct rte_lcore_state
{
        struct rte_a_state a_state;
        struct rte_b_state b_state;
        RTE_CACHE_GUARD;
};

struct rte_lcore_state lcore_states[RTE_LCORE_MAX];

In such a scenario, the per-lcore struct type for a module need not (andshould not) be cache-line-aligned (but may still have some alignmentrequirements). Data will be more tightly packed, and the "next lines"prefetched may actually be useful (although I'm guessing in practicethey will usually not).

There may be several ways to implement that scheme. The above is toillustrate how thing would look in memory, not necessarily on the levelof the source code.

One way could be to fit the per-module-per-lcore struct in a chunk ofmemory allocated in a per-lcore heap. In such a case, the DPDK heapwould need extension, maybe with semantics similar to that of NUMA-nodespecific allocations.

Another way would be to use thread-local storage (TLS, __thread),although it's unclear to me how well TLS works with larger data structures.

A third way may be to somehow achieve something that looks like theabove example, using macros, without breaking module encapsulation orgenerally be too intrusive or otherwise cumbersome.

Not sure this is worth the trouble (compared to just more padding), butI thought it was an idea worth sharing.

Re: [RFC] cache guard

Reply via email to