RE: [RFC 1/5] eal: add static per-lcore memory allocation facility

Morten Brørup Mon, 19 Feb 2024 03:10:50 -0800

> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se]
> Sent: Monday, 19 February 2024 08.49
> 
> On 2024-02-09 14:04, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se]
> >> Sent: Friday, 9 February 2024 12.46
> >>
> >> On 2024-02-09 09:25, Morten Brørup wrote:
> >>>> From: Mattias Rönnblom [mailto:mattias.ronnb...@ericsson.com]
> >>>> Sent: Thursday, 8 February 2024 19.17
> >>>>
> >>>> Introduce DPDK per-lcore id variables, or lcore variables for
> short.
> >>>>
> >>>> An lcore variable has one value for every current and future lcore
> >>>> id-equipped thread.
> >>>>
> >>>> The primary <rte_lcore_var.h> use case is for statically
> allocating
> >>>> small chunks of often-used data, which is related logically, but
> >> where
> >>>> there are performance benefits to reap from having updates being
> >> local
> >>>> to an lcore.
> >>>>
> >>>> Lcore variables are similar to thread-local storage (TLS, e.g.,
> C11
> >>>> _Thread_local), but decoupling the values' life time with that of
> >> the
> >>>> threads.
> >>>>
> >>>> Lcore variables are also similar in terms of functionality
> provided
> >> by
> >>>> FreeBSD kernel's DPCPU_*() family of macros and the associated
> >>>> build-time machinery. DPCPU uses linker scripts, which effectively
> >>>> prevents the reuse of its, otherwise seemingly viable, approach.
> >>>>
> >>>> The currently-prevailing way to solve the same problem as lcore
> >>>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-
> >> sized
> >>>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> >>>> lcore variables over this approach is that data related to the
> same
> >>>> lcore now is close (spatially, in memory), rather than data used
> by
> >>>> the same module, which in turn avoid excessive use of padding,
> >>>> polluting caches with unused data.
> >>>>
> >>>> Signed-off-by: Mattias Rönnblom <mattias.ronnb...@ericsson.com>
> >>>> ---
> >>>
> >>> This looks very promising. :-)
> >>>
> >>> Here's a bunch of comments, questions and suggestions.
> >>>
> >>>
> >>> * Question: Performance.
> >>> What is the cost of accessing an lcore variable vs a variable in
> TLS?
> >>> I suppose the relative cost diminishes if the variable is a larger
> >> struct, compared to a simple uint64_t.
> >>>
> >>
> >> In case all the relevant data is available in a cache close to the
> >> core,
> >> both options carry quite low overhead.
> >>
> >> Accessing a lcore variable will always require a TLS lookup, in the
> >> form
> >> of retrieving the lcore_id of the current thread. In that sense,
> there
> >> will likely be a number of extra instructions required to do the
> lcore
> >> variable address lookup (i.e., doing the load from rte_lcore_var
> table
> >> based on the lcore_id you just looked up, and adding the variable's
> >> offset).
> >>
> >> A TLS lookup will incur an extra overhead of less than a clock
> cycle,
> >> compared to accessing a non-TLS static variable, in case static
> linking
> >> is used. For shared objects, TLS is much more expensive (something
> >> often
> >> visible in dynamically linked DPDK app flame graphs, in the form of
> the
> >> __tls_addr symbol). Then you need to add ~3 cc/access. This on a
> micro
> >> benchmark running on a x86_64 Raptor Lake P-core.
> >>
> >> (To visialize the difference between shared object and not, one can
> use
> >> Compiler Explorer and -fPIC versus -fPIE.)
> >>
> >> Things get more complicated if you access the same variable in the
> same
> >> section code, since then it can be left on the stack/in a register
> by
> >> the compiler, especially if LTO is used. In other words, if you do
> >> rte_lcore_id() several times in a row, only the first one will cost
> you
> >> anything. This happens fairly often in DPDK, with rte_lcore_id().
> >>
> >> Finally, if you do something like
> >>
> >> diff --git a/lib/eal/common/rte_random.c
> b/lib/eal/common/rte_random.c
> >> index af9fffd81b..a65c30d27e 100644
> >> --- a/lib/eal/common/rte_random.c
> >> +++ b/lib/eal/common/rte_random.c
> >> @@ -125,14 +125,7 @@ __rte_rand_lfsr258(struct rte_rand_state
> *state)
> >>    static __rte_always_inline
> >>    struct rte_rand_state *__rte_rand_get_state(void)
> >>    {
> >> -       unsigned int idx;
> >> -
> >> -       idx = rte_lcore_id();
> >> -
> >> -       if (unlikely(idx == LCORE_ID_ANY))
> >> -               return &unregistered_rand_state;
> >> -
> >> -       return RTE_LCORE_VAR_PTR(rand_state);
> >> +       return &unregistered_rand_state;
> >>    }
> >>
> >>    uint64_t
> >>
> >> ...and re-run the rand_perf_autotest, at least I see no difference
> at
> >> all (in a statically linked build). Both results in rte_rand() using
> >> ~11
> >> cc/call. What that suggests is that TLS overhead is very small, and
> >> that
> >> any extra instructions required by lcore variables doesn't add much,
> if
> >> anything at all, at least in this particular case.
> >
> > Excellent. Thank you for a thorough and detailed answer, Mattias.
> >
> >>
> >>> Some of my suggestions below might also affect performance.
> >>>
> >>>
> >>> * Advantage: Provides direct access to worker thread variables.
> >>> With the current alternative (thread-local storage), the main
> thread
> >> cannot access the TLS variables of the worker threads,
> >>> unless worker threads publish global access pointers.
> >>> Lcore variables of any lcore thread can be direcctly accessed by
> any
> >> thread, which simplifies code.
> >>>
> >>>
> >>> * Advantage: Roadmap towards hugemem.
> >>> It would be nice if the lcore variable memory was allocated in
> >> hugemem, to reduce TLB misses.
> >>> The current alternative (thread-local storage) is also not using
> >> hugement, so not a degradation.
> >>>
> >>
> >> I agree, but the thing is it's hard to figure out how much memory is
> >> required for these kind of variables, given how DPDK is built and
> >> linked. In an OS kernel, you can just take all the symbols, put them
> in
> >> a special section, and size that section. Such a thing can't easily
> be
> >> done with DPDK, since shared object builds are supported, plus that
> >> this
> >> facility should be available not only to DPDK modules, but also the
> >> application, so relying on linker scripts isn't really feasible (not
> >> probably not even feasible for DPDK itself).
> >>
> >> In that scenario, you want to size up the per-lcore buffer to be so
> >> large, you don't have to worry about overruns. That will waste
> memory.
> >> If you use huge page memory, paging can't help you to avoid
> >> pre-allocating actual physical memory.
> >
> > Good point.
> > I had noticed that RTE_MAX_LCORE_VAR was 1 MB (per RTE_MAX_LCORE),
> but I hadn't considered how paging helps us use less physical memory
> than that.
> >
> >>
> >> That said, even large (by static per-lcore data standards) buffers
> are
> >> potentially small enough not to grow the amount of memory used by a
> >> DPDK
> >> process too much. You need to provision for RTE_MAX_LCORE of them
> >> though.
> >>
> >> The value of lcore variables should be small, and thus incur few TLB
> >> misses, so you may not gain much from huge pages. In my world, it's
> >> more
> >> about "fitting often-used per-lcore data into L1 or L2 CPU caches",
> >> rather than the easier "fitting often-used per-lcore data into a
> >> working
> >> set size reasonably expected to be covered by hardware TLB/caches".
> >
> > Yes, I suppose that lcore variables are intended to be small, and
> large per-lcore structures should keep following the current design
> patterns for allocation and access.
> >
> 
> It seems to me that support for per-lcore heaps should be the solution
> for supporting use cases requiring many, larger and/or dynamic objects
> on a per-lcore basis.
> 
> Ideally, you would design both that mechanism and lcore variables
> together, but then if you couple enough amount of improvements together
> you will never get anywhere. An instance of where perfect is the enemy
> of good, perhaps.


So true. :-)

> 
> > Perhaps this guideline is worth mentioning in the documentation.
> >
> 
> What is missing, more specifically? The size limitation and the static
> nature of lcore variables is described, and what current design
> patterns
> they expected to (partly) replace is also covered.

Your documentation is fine, and nothing specific is missing here.
I was thinking out loud that the high level DPDK documentation should describe 
common design patterns.

> 
> >>
> >>> Lcore variables are available very early at startup, so I guess the
> >> RTE memory allocator is not yet available.
> >>> Hugemem could be allocated using O/S allocation, so there is a
> >> possible road towards using hugemem.
> >>>
> >>
> >> With the current design, that true. I'm not sure it's a strict
> >> requirement though, but it does makes things simpler.
> >>
> >>> Either way, using hugement would require one more indirection (the
> >> pointer to the allocated hugemem).
> >>> I don't know which has better performance, using hugemem or
> avoiding
> >> the additional pointer dereferencing.
> >>>
> >>>
> >>> * Suggestion: Consider adding an entry for unregistered non-EAL
> >> threads.
> >>> Please consider making room for one more entry, shared by all
> >> unregistered non-EAL threads, i.e.
> >>> making the array size RTE_MAX_LCORE + 1 and indexing by
> >> (rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE).
> >>>
> >>> It would be convenient for the use cases where a variable shared by
> >> the unregistered non-EAL threads don't need special treatment.
> >>>
> >>
> >> I thought about this, but it would require a conditional in the
> lookup
> >> macro, as you show. More importantly, it would make the whole
> >> <rte_lcore_var.h> thing less elegant and harder to understand. It's
> bad
> >> enough that "per-lcore" is actually "per-lcore id" (or the
> equivalent
> >> "per-EAL thread and unregistered EAL-thread"). Adding a "btw it's
> <what
> >> I said before> + 1" is not an improvement.
> >
> > We could promote "one more entry for unregistered non-EAL threads"
> design pattern (for relevant use cases only!) by extending EAL with one
> more TLS variable, maintained like _thread_id, but set to RTE_MAX_LCORE
> when _tread_id is set to -1:
> >
> > +++ eal_common_thread.c:
> >    RTE_DEFINE_PER_LCORE(int, _thread_id) = -1;
> > + RTE_DEFINE_PER_LCORE(int, _thread_idx) = RTE_MAX_LCORE;

Ups... wrong reference! I meant to refer to _lcore_id, not _thread_id. 
Correction:

We could promote "one more entry for unregistered non-EAL threads" design 
pattern (for relevant use cases only!) by extending EAL with one more TLS 
variable, maintained like _lcore_id, but set to RTE_MAX_LCORE when _lcore_id is 
set to LCORE_ID_ANY:

+++ eal_common_thread.c:
  RTE_DEFINE_PER_LCORE(unsigned int, _lcore_id) = LCORE_ID_ANY;
+ RTE_DEFINE_PER_LCORE(unsigned int, _lcore_idx) = RTE_MAX_LCORE;

> >
> > and
> >
> > +++ rte_lcore.h:
> > static inline unsigned
> > rte_lcore_id(void)
> > {
> >     return RTE_PER_LCORE(_lcore_id);
> > }
> > + static inline unsigned
> > + rte_lcore_idx(void)
> > + {
> > +   return RTE_PER_LCORE(_lcore_idx);
> > + }
> >
> > That would eliminate the (rte_lcore_id() < RTE_MAX_LCORE ?
> rte_lcore_id() : RTE_MAX_LCORE) conditional, also where currently used.
> >
> 
> Wouldn't that effectively give a shared lcore id to all unregistered
> threads?

Yes, just like the rte_lcore_id() is LCORE_ID_ANY (i.e. UINT32_MAX) for all 
unregistered threads; but it will be usable for array indexing, behaving as a 
shadow variable of RTE_PER_LCORE(_lcore_id) for optimizing away the 
"rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE" when indexing.

> 
> We definitely shouldn't further complicate anything related to the DPDK
> threading model, in my opinion.
> 
> If a module needs one or more variable instances that aren't per lcore,
> use regular static allocation instead. I would favor clarity over
> convenience here, at least until we know better (see below as well).
> 
> >>
> >> But useful? Sure.
> >>
> >> I think you may still need other data for dealing with unregistered
> >> threads, for example a mutex or spin lock to deal with concurrency
> >> issues that arises with shared data.
> >
> > Adding the extra entry is only for the benefit of use cases where
> special handling is not required. It will make the code for those use
> cases much cleaner. I think it is useful.
> >
> 
> It will make it shorter, but not less clean, I would argue.
> 
> > Use cases requiring special handling should still do the special
> handling they do today.
> >
> 
> For DPDK modules using lcore variables and which treat unregistered
> threads as "full citizens", I expect special handling of unregistered
> threads to be the norm. Take rte_random.h as an example. Current API
> does not guarantee MT safety for concurrent calls of unregistered
> threads. It probably should, and it should probably be by means of a
> mutex (not spinlock).
> 
> The reason I'm not running off to make a rte_random.c patch is that's
> it's unclear to me what is the role of unregistered threads in DPDK.
> I'm
> reasonably comfortable with a model where there are many threads that
> basically don't interact with the DPDK APIs (except maybe some very
> narrow exposure, like the preemption-safe ring variant). One example of
> such a design would be big slow control plane which uses multi-
> threading
> and the Linux process scheduler for work scheduling, hosted in the same
> process as a DPDK data plane app.
> 
> What I find more strange is a scenario where there are unregistered
> threads which interacts with a wide variety of DPDK APIs, does so
> at-high-rates/with-high-performance-requirements and are expected to be
> preemption-safe. So they are basically EAL threads without a lcore id.

Yes, this is happening in the wild.
E.g. our application has a mode where it uses fewer EAL threads, and processes 
more in non-EAL threads. So to say, the same work is processed either by an EAL 
thread or a non-EAL thread, depending on the application's mode.
The extra array entry would be useful for such use cases.

> 
> Support for that latter scenario has also been voiced, in previous
> discussions, from what I recall.
> 
> I think it's hard to answer the question of a "unregistered thread
> spare" for lcore variables without first knowing what the future should
> look like for unregistered threads in DPDK, in terms of being able to
> call into DPDK APIs, preemption-safety guarantees, etc.
> 
> It seems that until you have a clearer picture of how generally to
> treat
> unregistered threads, you are best off with just a per-lcore id
> instance
> of lcore variables.

I get your point. It also reduces the risk of bugs caused by incorrect use of 
the additional entry.

I am arguing for a different angle: Providing the extra entry will help 
uncovering relevant use cases.

> 
> >>
> >> There may also be cases were you are best off by simply disallowing
> >> unregistered threads from calling into that API.
> >>
> >>> Obviously, this might affect performance.
> >>> If the performance cost is not negligble, the addtional entry (and
> >> indexing branch) could be disabled at build time.
> >>>
> >>>
> >>> * Suggestion: Do not fix the alignment at 16 byte.
> >>> Pass an alignment parameter to rte_lcore_var_alloc() and use
> >> alignof() when calling it:
> >>>
> >>> +#include <stdalign.h>
> >>> +
> >>> +#define RTE_LCORE_VAR_ALLOC(name)                        \
> >>> + name = rte_lcore_var_alloc(sizeof(*(name)), alignof(*(name)))
> >>> +
> >>> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, alignment)
>       \
> >>> + name = rte_lcore_var_alloc(size, alignment)
> >>> +
> >>> +#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)     \
> >>> + name = rte_lcore_var_alloc(size, RTE_LCORE_VAR_ALIGNMENT_DEFAULT)
> >>> +
> >>> + +++ /cconfig/rte_config.h
> >>> +#define RTE_LCORE_VAR_ALIGNMENT_DEFAULT 16
> >>>
> >>>
> >>
> >> That seems like a very good idea. I'll look into it.
> >>
> >>> * Concern: RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(),
> but
> >> behaves differently.
> >>>
> >>>> +/**
> >>>> + * Iterate over each lcore id's value for a lcore variable.
> >>>> + */
> >>>> +#define RTE_LCORE_VAR_FOREACH(var, name)                                
> >>>> \
> >>>> +        for (unsigned int lcore_id =                                    
> >>>> \
> >>>> +                     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);
>       \
> >>>> +             lcore_id < RTE_MAX_LCORE;
>       \
> >>>> +             lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id,
> name))
> >>>> +
> >>>
> >>> The macro name RTE_LCORE_VAR_FOREACH() resembles
> >> RTE_LCORE_FOREACH(i), which only iterates on running cores.
> >>> You might want to give it a name that differs more.
> >>>
> >>
> >> True.
> >>
> >> Maybe RTE_LCORE_VAR_FOREACH_VALUE() is better? Still room for
> >> confusion,
> >> for sure.
> >>
> >> Being consistent with <rte_lcore.h> is not so easy, since it's not
> even
> >> consistent with itself. For example, rte_lcore_count() returns the
> >> number of lcores (EAL threads) *plus the number of registered non-
> EAL
> >> threads*, and RTE_LCORE_FOREACH() give a different count. :)
> >
> > Naming is hard. I don't have a good name, and can only offer
> inspiration...
> >
> > <rte_lcore.h> has RTE_LCORE_FOREACH() and its
> RTE_LCORE_FOREACH_WORKER() variant with _WORKER appended.
> >
> > Perhaps RTE_LCORE_VAR_FOREACH_ALL(), with _ALL appended to indicate a
> variant.
> >
> >>
> >>> If it wasn't for API breakage, I would suggest renaming
> >> RTE_LCORE_FOREACH() instead, but that's not realistic. ;-)
> >>>
> >>> Small detail: "var" is a pointer, so consider renaming it to "ptr"
> >> and adding _PTR to the macro name.
> >>
> >> The "var" name comes from how <sys/queue.h> names things. I think I
> had
> >> it as "ptr" initially. I'll change it back.
> >
> > Thanks.
> >
> >>
> >> Thanks a lot Morten.

RE: [RFC 1/5] eal: add static per-lcore memory allocation facility

Reply via email to