On 2024-02-09 09:25, Morten Brørup wrote:
From: Mattias Rönnblom [mailto:mattias.ronnb...@ericsson.com]
Sent: Thursday, 8 February 2024 19.17
Introduce DPDK per-lcore id variables, or lcore variables for short.
An lcore variable has one value for every current and future lcore
id-equipped thread.
The primary <rte_lcore_var.h> use case is for statically allocating
small chunks of often-used data, which is related logically, but where
there are performance benefits to reap from having updates being local
to an lcore.
Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.
Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.
The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.
Signed-off-by: Mattias Rönnblom <mattias.ronnb...@ericsson.com>
---
This looks very promising. :-)
Here's a bunch of comments, questions and suggestions.
* Question: Performance.
What is the cost of accessing an lcore variable vs a variable in TLS?
I suppose the relative cost diminishes if the variable is a larger struct,
compared to a simple uint64_t.
In case all the relevant data is available in a cache close to the core,
both options carry quite low overhead.
Accessing a lcore variable will always require a TLS lookup, in the form
of retrieving the lcore_id of the current thread. In that sense, there
will likely be a number of extra instructions required to do the lcore
variable address lookup (i.e., doing the load from rte_lcore_var table
based on the lcore_id you just looked up, and adding the variable's offset).
A TLS lookup will incur an extra overhead of less than a clock cycle,
compared to accessing a non-TLS static variable, in case static linking
is used. For shared objects, TLS is much more expensive (something often
visible in dynamically linked DPDK app flame graphs, in the form of the
__tls_addr symbol). Then you need to add ~3 cc/access. This on a micro
benchmark running on a x86_64 Raptor Lake P-core.
(To visialize the difference between shared object and not, one can use
Compiler Explorer and -fPIC versus -fPIE.)
Things get more complicated if you access the same variable in the same
section code, since then it can be left on the stack/in a register by
the compiler, especially if LTO is used. In other words, if you do
rte_lcore_id() several times in a row, only the first one will cost you
anything. This happens fairly often in DPDK, with rte_lcore_id().
Finally, if you do something like
diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index af9fffd81b..a65c30d27e 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -125,14 +125,7 @@ __rte_rand_lfsr258(struct rte_rand_state *state)
static __rte_always_inline
struct rte_rand_state *__rte_rand_get_state(void)
{
- unsigned int idx;
-
- idx = rte_lcore_id();
-
- if (unlikely(idx == LCORE_ID_ANY))
- return &unregistered_rand_state;
-
- return RTE_LCORE_VAR_PTR(rand_state);
+ return &unregistered_rand_state;
}
uint64_t
...and re-run the rand_perf_autotest, at least I see no difference at
all (in a statically linked build). Both results in rte_rand() using ~11
cc/call. What that suggests is that TLS overhead is very small, and that
any extra instructions required by lcore variables doesn't add much, if
anything at all, at least in this particular case.
Some of my suggestions below might also affect performance.
* Advantage: Provides direct access to worker thread variables.
With the current alternative (thread-local storage), the main thread cannot
access the TLS variables of the worker threads,
unless worker threads publish global access pointers.
Lcore variables of any lcore thread can be direcctly accessed by any thread,
which simplifies code.
* Advantage: Roadmap towards hugemem.
It would be nice if the lcore variable memory was allocated in hugemem, to
reduce TLB misses.
The current alternative (thread-local storage) is also not using hugement, so
not a degradation.
I agree, but the thing is it's hard to figure out how much memory is
required for these kind of variables, given how DPDK is built and
linked. In an OS kernel, you can just take all the symbols, put them in
a special section, and size that section. Such a thing can't easily be
done with DPDK, since shared object builds are supported, plus that this
facility should be available not only to DPDK modules, but also the
application, so relying on linker scripts isn't really feasible (not
probably not even feasible for DPDK itself).
In that scenario, you want to size up the per-lcore buffer to be so
large, you don't have to worry about overruns. That will waste memory.
If you use huge page memory, paging can't help you to avoid
pre-allocating actual physical memory.
That said, even large (by static per-lcore data standards) buffers are
potentially small enough not to grow the amount of memory used by a DPDK
process too much. You need to provision for RTE_MAX_LCORE of them though.
The value of lcore variables should be small, and thus incur few TLB
misses, so you may not gain much from huge pages. In my world, it's more
about "fitting often-used per-lcore data into L1 or L2 CPU caches",
rather than the easier "fitting often-used per-lcore data into a working
set size reasonably expected to be covered by hardware TLB/caches".
Lcore variables are available very early at startup, so I guess the RTE memory
allocator is not yet available.
Hugemem could be allocated using O/S allocation, so there is a possible road
towards using hugemem.
With the current design, that true. I'm not sure it's a strict
requirement though, but it does makes things simpler.
Either way, using hugement would require one more indirection (the pointer to
the allocated hugemem).
I don't know which has better performance, using hugemem or avoiding the
additional pointer dereferencing.
* Suggestion: Consider adding an entry for unregistered non-EAL threads.
Please consider making room for one more entry, shared by all unregistered
non-EAL threads, i.e.
making the array size RTE_MAX_LCORE + 1 and indexing by (rte_lcore_id() <
RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE).
It would be convenient for the use cases where a variable shared by the
unregistered non-EAL threads don't need special treatment.
I thought about this, but it would require a conditional in the lookup
macro, as you show. More importantly, it would make the whole
<rte_lcore_var.h> thing less elegant and harder to understand. It's bad
enough that "per-lcore" is actually "per-lcore id" (or the equivalent
"per-EAL thread and unregistered EAL-thread"). Adding a "btw it's <what
I said before> + 1" is not an improvement.
But useful? Sure.
I think you may still need other data for dealing with unregistered
threads, for example a mutex or spin lock to deal with concurrency
issues that arises with shared data.
There may also be cases were you are best off by simply disallowing
unregistered threads from calling into that API.
Obviously, this might affect performance.
If the performance cost is not negligble, the addtional entry (and indexing
branch) could be disabled at build time.
* Suggestion: Do not fix the alignment at 16 byte.
Pass an alignment parameter to rte_lcore_var_alloc() and use alignof() when
calling it:
+#include <stdalign.h>
+
+#define RTE_LCORE_VAR_ALLOC(name) \
+ name = rte_lcore_var_alloc(sizeof(*(name)), alignof(*(name)))
+
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, alignment) \
+ name = rte_lcore_var_alloc(size, alignment)
+
+#define RTE_LCORE_VAR_ALLOC_SIZE(name, size) \
+ name = rte_lcore_var_alloc(size, RTE_LCORE_VAR_ALIGNMENT_DEFAULT)
+
+ +++ /cconfig/rte_config.h
+#define RTE_LCORE_VAR_ALIGNMENT_DEFAULT 16
That seems like a very good idea. I'll look into it.
* Concern: RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(), but behaves
differently.
+/**
+ * Iterate over each lcore id's value for a lcore variable.
+ */
+#define RTE_LCORE_VAR_FOREACH(var, name) \
+ for (unsigned int lcore_id = \
+ (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0); \
+ lcore_id < RTE_MAX_LCORE; \
+ lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
+
The macro name RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(i), which
only iterates on running cores.
You might want to give it a name that differs more.
True.
Maybe RTE_LCORE_VAR_FOREACH_VALUE() is better? Still room for confusion,
for sure.
Being consistent with <rte_lcore.h> is not so easy, since it's not even
consistent with itself. For example, rte_lcore_count() returns the
number of lcores (EAL threads) *plus the number of registered non-EAL
threads*, and RTE_LCORE_FOREACH() give a different count. :)
If it wasn't for API breakage, I would suggest renaming RTE_LCORE_FOREACH()
instead, but that's not realistic. ;-)
Small detail: "var" is a pointer, so consider renaming it to "ptr" and adding
_PTR to the macro name.
The "var" name comes from how <sys/queue.h> names things. I think I had
it as "ptr" initially. I'll change it back.
Thanks a lot Morten.