On 2024-02-20 10:11, Bruce Richardson wrote:
On Tue, Feb 20, 2024 at 09:49:03AM +0100, Mattias Rönnblom wrote:
Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small chunks of often-used data, which is related logically, but where
there are performance benefits to reap from having updates being local
to an lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

RFC v3:
  * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
  * Update example to reflect FOREACH macro name change (in RFC v2).

RFC v2:
  * Use alignof to derive alignment requirements. (Morten Brørup)
  * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
    *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
  * Allow user-specified alignment, but limit max to cache line size.

Signed-off-by: Mattias Rönnblom <mattias.ronnb...@ericsson.com>
---
  config/rte_config.h                   |   1 +
  doc/api/doxy-api-index.md             |   1 +
  lib/eal/common/eal_common_lcore_var.c |  82 ++++++
  lib/eal/common/meson.build            |   1 +
  lib/eal/include/meson.build           |   1 +
  lib/eal/include/rte_lcore_var.h       | 375 ++++++++++++++++++++++++++
  lib/eal/version.map                   |   4 +
  7 files changed, 465 insertions(+)
  create mode 100644 lib/eal/common/eal_common_lcore_var.c
  create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/config/rte_config.h b/config/rte_config.h
index da265d7dd2..884482e473 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -30,6 +30,7 @@
  /* EAL defines */
  #define RTE_CACHE_GUARD_LINES 1
  #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
  #define RTE_MAX_MEMSEG_LISTS 128
  #define RTE_MAX_MEMSEG_PER_LIST 8192
  #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index a6a768bd7c..bb06bb7ca1 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -98,6 +98,7 @@ The public API headers are grouped by topics:
    [interrupts](@ref rte_interrupts.h),
    [launch](@ref rte_launch.h),
    [lcore](@ref rte_lcore.h),
+  [lcore-varible](@ref rte_lcore_var.h),
    [per-lcore](@ref rte_per_lcore.h),
    [service cores](@ref rte_service.h),
    [keepalive](@ref rte_keepalive.h),
diff --git a/lib/eal/common/eal_common_lcore_var.c 
b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..dfd11cbd0b
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,82 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+
+#define WARN_THRESHOLD 75
+
+/*
+ * Avoid using offset zero, since it would result in a NULL-value
+ * "handle" (offset) pointer, which in principle and per the API
+ * definition shouldn't be an issue, but may confuse some tools and
+ * users.
+ */
+#define INITIAL_OFFSET 1
+
+char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
+

While I like the idea of improved handling for per-core variables, my main
concern with this set is this definition here, which adds yet another
dependency on the compile-time defined RTE_MAX_LCORE value.


lcore variables replaces one RTE_MAX_LCORE-dependent pattern with another.

You could even argue the dependency on RTE_MAX_LCORE is reduced with lcore variables, if you look at where/in how many places in the code base this macro is being used. Centralizing per-lcore data management may also provide some opportunity in the future for extending the API to cope with some more dynamic RTE_MAX_LCORE variant. Not without ABI breakage of course, but we are not ever going to change anything related to RTE_MAX_LCORE without breaking the ABI, since this constant is everywhere, including compiled into the application itself.

I believe we already have an issue with this #define where it's impossible
to come up with a single value that works for all, or nearly all cases. The
current default is still 128, yet DPDK needs to support systems where the
number of cores is well into the hundreds, requiring workarounds of core
mappings or customized builds of DPDK. Upping the value fixes those issues
at the cost to memory footprint explosion for smaller systems.


I agree this is an issue.

RTE_MAX_LCORE also need to be sized to accommodate not only all cores used, but the sum of all EAL threads and registered non-EAL threads.

So, there is no reliable way to discover what RTE_MAX_LCORE is on a particular piece of hardware, since the actual number of lcore ids needed is up to the application.

Why is the default set so low? Linux has MAX_CPUS, which serves the same purpose, which is set to 4096 by default, if I recall correctly. Shouldn't we at least be able to increase it to 256?

I'm therefore nervous about putting more dependencies on this value, when I
feel we should be moving away from its use, to allow more runtime
configurability of cores.


What more specifically do you have in mind?

Maybe I'm overly pessimistic, but supporting lcores without any upper bound and also allowing them to be added and removed at any point during run time seems far-fetched, given where DPDK is today.

To include an actual upper bound, set during DPDK run-time initialization, lower than RTE_MAX_LCORE, seems easier. I think there is some equivalent in the Linux kernel. Again, you would need to accommodate for future rte_register_thread() calls.

<rte_lcore_var.h> could be extended with a user-specified lcore variable init/free function callbacks, to allow lazy or late initialization.

If one could have a way to retrieve the max possible lcore ids *for a particular DPDK process* (as opposed to a particular build) it would be possible to avoid touching the per-lcore buffers for lcore ids that would never be used. With data in BSS, it would never be mapped/allocated.

An issue with BSS data is that there might be very RT-sensitive applications deciding to lock all memory into RAM, to avoid latency jitter caused by paging, and such would suffer from a large rte_lcore_var (or all the current static arrays). Lcore variables makes this worse, since rte_lcore_var is larger than the sum of today's static arrays, and must be so, with some margin, since there is no way to figure out ahead of time how much memory is actually going to be needed.

For this set/feature, would it be possible to have a run-time allocated
(and sized) array rather than using the RTE_MAX_LCORE value?


What I explored was having the per-lcore buffers dynamically allocated. What I ran into was I saw no apparent benefit, and with dynamic allocation there were new problems to solve. One was to assure lcore variable buffers were allocated before they were being used. In particular if you want to use huge page memory, lcore variables may be available only when that machinery is ready to accept requests.

Also, with huge page memory, you won't get the benefit you will get from depend paging and BSS (i.e., only used memory is actually allocated).

With malloc(), I believe you generally do get that same benefit, if you allocation is sufficiently large.

I also considered just allocating chunks, fitting (say) 64 kB worth of lcore variables in each. Turned out more complex, and to no benefit, other than reducing footprint for mlockall() type apps, which seemed like corner case.

I never considered no upper-bound, dynamic, RTE_MAX_LCORE.

Thanks, (and apologies for the mini-rant!)

/Bruce

Thanks for the comments. This is was no way near a rant.

Reply via email to