On 2024-02-20 12:39, Bruce Richardson wrote:
On Tue, Feb 20, 2024 at 11:47:14AM +0100, Mattias Rönnblom wrote:
On 2024-02-20 10:11, Bruce Richardson wrote:
On Tue, Feb 20, 2024 at 09:49:03AM +0100, Mattias Rönnblom wrote:
Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small chunks of often-used data, which is related logically, but where
there are performance benefits to reap from having updates being local
to an lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

<snip>

+/*
+ * Avoid using offset zero, since it would result in a NULL-value
+ * "handle" (offset) pointer, which in principle and per the API
+ * definition shouldn't be an issue, but may confuse some tools and
+ * users.
+ */
+#define INITIAL_OFFSET 1
+
+char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
+

While I like the idea of improved handling for per-core variables, my main
concern with this set is this definition here, which adds yet another
dependency on the compile-time defined RTE_MAX_LCORE value.


lcore variables replaces one RTE_MAX_LCORE-dependent pattern with another.

You could even argue the dependency on RTE_MAX_LCORE is reduced with lcore
variables, if you look at where/in how many places in the code base this
macro is being used. Centralizing per-lcore data management may also provide
some opportunity in the future for extending the API to cope with some more
dynamic RTE_MAX_LCORE variant. Not without ABI breakage of course, but we
are not ever going to change anything related to RTE_MAX_LCORE without
breaking the ABI, since this constant is everywhere, including compiled into
the application itself.


Yep, that is true if it's widely used.

I believe we already have an issue with this #define where it's impossible
to come up with a single value that works for all, or nearly all cases. The
current default is still 128, yet DPDK needs to support systems where the
number of cores is well into the hundreds, requiring workarounds of core
mappings or customized builds of DPDK. Upping the value fixes those issues
at the cost to memory footprint explosion for smaller systems.


I agree this is an issue.

RTE_MAX_LCORE also need to be sized to accommodate not only all cores used,
but the sum of all EAL threads and registered non-EAL threads.

So, there is no reliable way to discover what RTE_MAX_LCORE is on a
particular piece of hardware, since the actual number of lcore ids needed is
up to the application.

Why is the default set so low? Linux has MAX_CPUS, which serves the same
purpose, which is set to 4096 by default, if I recall correctly. Shouldn't
we at least be able to increase it to 256?

The default is so low because of the mempool caches. These are an array of
buffer pointers with 512 (IIRC) entries per core up to RTE_MAX_LCORE.


I'm therefore nervous about putting more dependencies on this value, when I
feel we should be moving away from its use, to allow more runtime
configurability of cores.


What more specifically do you have in mind?


I don't think having a dynamically scaling RTE_MAX_LCORE is feasible, but
what I would like to see is a runtime specified value. For example, you
could run DPDK with EAL parameter "--max-lcores=1024" for large systems or
"--max-lcores=32" for small ones. That would then be used at init-time to
scale all internal datastructures appropriately.


Sounds reasonably to me, especially if you would take gradual approach.

By gradual I mean something like adding a function rte_lcore_max_possible(), or something like that, returning the EAL init-specified value. DPDK libraries/PMDs could then gradually be made aware and taking advantage of knowing that lcore ids will always be below a certain threshold, usually significantly lower than RTE_MAX_LCORE.

The only change required for lcore variables would be that the FOREACH macro would use the run-time-max value, rather than RTE_MAX_LCORE, which in turn would leave all the higher-numbered lcore id buffers untouched/unmapped.

The set of possible lcore ids could also be expressed as a bitset, if you have machine with a huge amount of cores, running many small DPDK instances.

/Bruce

<snip for brevity>

Reply via email to