> From: Mattias Rönnblom [mailto:mattias.ronnb...@ericsson.com]
> Sent: Thursday, 8 February 2024 19.17
> 
> Introduce DPDK per-lcore id variables, or lcore variables for short.
> 
> An lcore variable has one value for every current and future lcore
> id-equipped thread.
> 
> The primary <rte_lcore_var.h> use case is for statically allocating
> small chunks of often-used data, which is related logically, but where
> there are performance benefits to reap from having updates being local
> to an lcore.
> 
> Lcore variables are similar to thread-local storage (TLS, e.g., C11
> _Thread_local), but decoupling the values' life time with that of the
> threads.
> 
> Lcore variables are also similar in terms of functionality provided by
> FreeBSD kernel's DPCPU_*() family of macros and the associated
> build-time machinery. DPCPU uses linker scripts, which effectively
> prevents the reuse of its, otherwise seemingly viable, approach.
> 
> The currently-prevailing way to solve the same problem as lcore
> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> lcore variables over this approach is that data related to the same
> lcore now is close (spatially, in memory), rather than data used by
> the same module, which in turn avoid excessive use of padding,
> polluting caches with unused data.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnb...@ericsson.com>
> ---

This looks very promising. :-)

Here's a bunch of comments, questions and suggestions.


* Question: Performance.
What is the cost of accessing an lcore variable vs a variable in TLS?
I suppose the relative cost diminishes if the variable is a larger struct, 
compared to a simple uint64_t.

Some of my suggestions below might also affect performance.


* Advantage: Provides direct access to worker thread variables.
With the current alternative (thread-local storage), the main thread cannot 
access the TLS variables of the worker threads,
unless worker threads publish global access pointers.
Lcore variables of any lcore thread can be direcctly accessed by any thread, 
which simplifies code.


* Advantage: Roadmap towards hugemem.
It would be nice if the lcore variable memory was allocated in hugemem, to 
reduce TLB misses.
The current alternative (thread-local storage) is also not using hugement, so 
not a degradation.

Lcore variables are available very early at startup, so I guess the RTE memory 
allocator is not yet available.
Hugemem could be allocated using O/S allocation, so there is a possible road 
towards using hugemem.

Either way, using hugement would require one more indirection (the pointer to 
the allocated hugemem).
I don't know which has better performance, using hugemem or avoiding the 
additional pointer dereferencing.


* Suggestion: Consider adding an entry for unregistered non-EAL threads.
Please consider making room for one more entry, shared by all unregistered 
non-EAL threads, i.e.
making the array size RTE_MAX_LCORE + 1 and indexing by (rte_lcore_id() < 
RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE).

It would be convenient for the use cases where a variable shared by the 
unregistered non-EAL threads don't need special treatment.

Obviously, this might affect performance.
If the performance cost is not negligble, the addtional entry (and indexing 
branch) could be disabled at build time.


* Suggestion: Do not fix the alignment at 16 byte.
Pass an alignment parameter to rte_lcore_var_alloc() and use alignof() when 
calling it:

+#include <stdalign.h>
+
+#define RTE_LCORE_VAR_ALLOC(name)                      \
+       name = rte_lcore_var_alloc(sizeof(*(name)), alignof(*(name)))
+
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, alignment)  \
+       name = rte_lcore_var_alloc(size, alignment)
+
+#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)   \
+       name = rte_lcore_var_alloc(size, RTE_LCORE_VAR_ALIGNMENT_DEFAULT)
+
+ +++ /cconfig/rte_config.h
+#define RTE_LCORE_VAR_ALIGNMENT_DEFAULT 16


* Concern: RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(), but behaves 
differently.

> +/**
> + * Iterate over each lcore id's value for a lcore variable.
> + */
> +#define RTE_LCORE_VAR_FOREACH(var, name)                             \
> +     for (unsigned int lcore_id =                                    \
> +                  (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);   \
> +          lcore_id < RTE_MAX_LCORE;                                  \
> +          lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
> +

The macro name RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(i), which 
only iterates on running cores.
You might want to give it a name that differs more.

If it wasn't for API breakage, I would suggest renaming RTE_LCORE_FOREACH() 
instead, but that's not realistic. ;-)

Small detail: "var" is a pointer, so consider renaming it to "ptr" and adding 
_PTR to the macro name.

Reply via email to