> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se] > Sent: Monday, 19 February 2024 08.49 > > On 2024-02-09 14:04, Morten Brørup wrote: > >> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se] > >> Sent: Friday, 9 February 2024 12.46 > >> > >> On 2024-02-09 09:25, Morten Brørup wrote: > >>>> From: Mattias Rönnblom [mailto:mattias.ronnb...@ericsson.com] > >>>> Sent: Thursday, 8 February 2024 19.17 > >>>> > >>>> Introduce DPDK per-lcore id variables, or lcore variables for > short. > >>>> > >>>> An lcore variable has one value for every current and future lcore > >>>> id-equipped thread. > >>>> > >>>> The primary <rte_lcore_var.h> use case is for statically > allocating > >>>> small chunks of often-used data, which is related logically, but > >> where > >>>> there are performance benefits to reap from having updates being > >> local > >>>> to an lcore. > >>>> > >>>> Lcore variables are similar to thread-local storage (TLS, e.g., > C11 > >>>> _Thread_local), but decoupling the values' life time with that of > >> the > >>>> threads. > >>>> > >>>> Lcore variables are also similar in terms of functionality > provided > >> by > >>>> FreeBSD kernel's DPCPU_*() family of macros and the associated > >>>> build-time machinery. DPCPU uses linker scripts, which effectively > >>>> prevents the reuse of its, otherwise seemingly viable, approach. > >>>> > >>>> The currently-prevailing way to solve the same problem as lcore > >>>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE- > >> sized > >>>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of > >>>> lcore variables over this approach is that data related to the > same > >>>> lcore now is close (spatially, in memory), rather than data used > by > >>>> the same module, which in turn avoid excessive use of padding, > >>>> polluting caches with unused data. > >>>> > >>>> Signed-off-by: Mattias Rönnblom <mattias.ronnb...@ericsson.com> > >>>> --- > >>> > >>> This looks very promising. :-) > >>> > >>> Here's a bunch of comments, questions and suggestions. > >>> > >>> > >>> * Question: Performance. > >>> What is the cost of accessing an lcore variable vs a variable in > TLS? > >>> I suppose the relative cost diminishes if the variable is a larger > >> struct, compared to a simple uint64_t. > >>> > >> > >> In case all the relevant data is available in a cache close to the > >> core, > >> both options carry quite low overhead. > >> > >> Accessing a lcore variable will always require a TLS lookup, in the > >> form > >> of retrieving the lcore_id of the current thread. In that sense, > there > >> will likely be a number of extra instructions required to do the > lcore > >> variable address lookup (i.e., doing the load from rte_lcore_var > table > >> based on the lcore_id you just looked up, and adding the variable's > >> offset). > >> > >> A TLS lookup will incur an extra overhead of less than a clock > cycle, > >> compared to accessing a non-TLS static variable, in case static > linking > >> is used. For shared objects, TLS is much more expensive (something > >> often > >> visible in dynamically linked DPDK app flame graphs, in the form of > the > >> __tls_addr symbol). Then you need to add ~3 cc/access. This on a > micro > >> benchmark running on a x86_64 Raptor Lake P-core. > >> > >> (To visialize the difference between shared object and not, one can > use > >> Compiler Explorer and -fPIC versus -fPIE.) > >> > >> Things get more complicated if you access the same variable in the > same > >> section code, since then it can be left on the stack/in a register > by > >> the compiler, especially if LTO is used. In other words, if you do > >> rte_lcore_id() several times in a row, only the first one will cost > you > >> anything. This happens fairly often in DPDK, with rte_lcore_id(). > >> > >> Finally, if you do something like > >> > >> diff --git a/lib/eal/common/rte_random.c > b/lib/eal/common/rte_random.c > >> index af9fffd81b..a65c30d27e 100644 > >> --- a/lib/eal/common/rte_random.c > >> +++ b/lib/eal/common/rte_random.c > >> @@ -125,14 +125,7 @@ __rte_rand_lfsr258(struct rte_rand_state > *state) > >> static __rte_always_inline > >> struct rte_rand_state *__rte_rand_get_state(void) > >> { > >> - unsigned int idx; > >> - > >> - idx = rte_lcore_id(); > >> - > >> - if (unlikely(idx == LCORE_ID_ANY)) > >> - return &unregistered_rand_state; > >> - > >> - return RTE_LCORE_VAR_PTR(rand_state); > >> + return &unregistered_rand_state; > >> } > >> > >> uint64_t > >> > >> ...and re-run the rand_perf_autotest, at least I see no difference > at > >> all (in a statically linked build). Both results in rte_rand() using > >> ~11 > >> cc/call. What that suggests is that TLS overhead is very small, and > >> that > >> any extra instructions required by lcore variables doesn't add much, > if > >> anything at all, at least in this particular case. > > > > Excellent. Thank you for a thorough and detailed answer, Mattias. > > > >> > >>> Some of my suggestions below might also affect performance. > >>> > >>> > >>> * Advantage: Provides direct access to worker thread variables. > >>> With the current alternative (thread-local storage), the main > thread > >> cannot access the TLS variables of the worker threads, > >>> unless worker threads publish global access pointers. > >>> Lcore variables of any lcore thread can be direcctly accessed by > any > >> thread, which simplifies code. > >>> > >>> > >>> * Advantage: Roadmap towards hugemem. > >>> It would be nice if the lcore variable memory was allocated in > >> hugemem, to reduce TLB misses. > >>> The current alternative (thread-local storage) is also not using > >> hugement, so not a degradation. > >>> > >> > >> I agree, but the thing is it's hard to figure out how much memory is > >> required for these kind of variables, given how DPDK is built and > >> linked. In an OS kernel, you can just take all the symbols, put them > in > >> a special section, and size that section. Such a thing can't easily > be > >> done with DPDK, since shared object builds are supported, plus that > >> this > >> facility should be available not only to DPDK modules, but also the > >> application, so relying on linker scripts isn't really feasible (not > >> probably not even feasible for DPDK itself). > >> > >> In that scenario, you want to size up the per-lcore buffer to be so > >> large, you don't have to worry about overruns. That will waste > memory. > >> If you use huge page memory, paging can't help you to avoid > >> pre-allocating actual physical memory. > > > > Good point. > > I had noticed that RTE_MAX_LCORE_VAR was 1 MB (per RTE_MAX_LCORE), > but I hadn't considered how paging helps us use less physical memory > than that. > > > >> > >> That said, even large (by static per-lcore data standards) buffers > are > >> potentially small enough not to grow the amount of memory used by a > >> DPDK > >> process too much. You need to provision for RTE_MAX_LCORE of them > >> though. > >> > >> The value of lcore variables should be small, and thus incur few TLB > >> misses, so you may not gain much from huge pages. In my world, it's > >> more > >> about "fitting often-used per-lcore data into L1 or L2 CPU caches", > >> rather than the easier "fitting often-used per-lcore data into a > >> working > >> set size reasonably expected to be covered by hardware TLB/caches". > > > > Yes, I suppose that lcore variables are intended to be small, and > large per-lcore structures should keep following the current design > patterns for allocation and access. > > > > It seems to me that support for per-lcore heaps should be the solution > for supporting use cases requiring many, larger and/or dynamic objects > on a per-lcore basis. > > Ideally, you would design both that mechanism and lcore variables > together, but then if you couple enough amount of improvements together > you will never get anywhere. An instance of where perfect is the enemy > of good, perhaps.
So true. :-) > > > Perhaps this guideline is worth mentioning in the documentation. > > > > What is missing, more specifically? The size limitation and the static > nature of lcore variables is described, and what current design > patterns > they expected to (partly) replace is also covered. Your documentation is fine, and nothing specific is missing here. I was thinking out loud that the high level DPDK documentation should describe common design patterns. > > >> > >>> Lcore variables are available very early at startup, so I guess the > >> RTE memory allocator is not yet available. > >>> Hugemem could be allocated using O/S allocation, so there is a > >> possible road towards using hugemem. > >>> > >> > >> With the current design, that true. I'm not sure it's a strict > >> requirement though, but it does makes things simpler. > >> > >>> Either way, using hugement would require one more indirection (the > >> pointer to the allocated hugemem). > >>> I don't know which has better performance, using hugemem or > avoiding > >> the additional pointer dereferencing. > >>> > >>> > >>> * Suggestion: Consider adding an entry for unregistered non-EAL > >> threads. > >>> Please consider making room for one more entry, shared by all > >> unregistered non-EAL threads, i.e. > >>> making the array size RTE_MAX_LCORE + 1 and indexing by > >> (rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE). > >>> > >>> It would be convenient for the use cases where a variable shared by > >> the unregistered non-EAL threads don't need special treatment. > >>> > >> > >> I thought about this, but it would require a conditional in the > lookup > >> macro, as you show. More importantly, it would make the whole > >> <rte_lcore_var.h> thing less elegant and harder to understand. It's > bad > >> enough that "per-lcore" is actually "per-lcore id" (or the > equivalent > >> "per-EAL thread and unregistered EAL-thread"). Adding a "btw it's > <what > >> I said before> + 1" is not an improvement. > > > > We could promote "one more entry for unregistered non-EAL threads" > design pattern (for relevant use cases only!) by extending EAL with one > more TLS variable, maintained like _thread_id, but set to RTE_MAX_LCORE > when _tread_id is set to -1: > > > > +++ eal_common_thread.c: > > RTE_DEFINE_PER_LCORE(int, _thread_id) = -1; > > + RTE_DEFINE_PER_LCORE(int, _thread_idx) = RTE_MAX_LCORE; Ups... wrong reference! I meant to refer to _lcore_id, not _thread_id. Correction: We could promote "one more entry for unregistered non-EAL threads" design pattern (for relevant use cases only!) by extending EAL with one more TLS variable, maintained like _lcore_id, but set to RTE_MAX_LCORE when _lcore_id is set to LCORE_ID_ANY: +++ eal_common_thread.c: RTE_DEFINE_PER_LCORE(unsigned int, _lcore_id) = LCORE_ID_ANY; + RTE_DEFINE_PER_LCORE(unsigned int, _lcore_idx) = RTE_MAX_LCORE; > > > > and > > > > +++ rte_lcore.h: > > static inline unsigned > > rte_lcore_id(void) > > { > > return RTE_PER_LCORE(_lcore_id); > > } > > + static inline unsigned > > + rte_lcore_idx(void) > > + { > > + return RTE_PER_LCORE(_lcore_idx); > > + } > > > > That would eliminate the (rte_lcore_id() < RTE_MAX_LCORE ? > rte_lcore_id() : RTE_MAX_LCORE) conditional, also where currently used. > > > > Wouldn't that effectively give a shared lcore id to all unregistered > threads? Yes, just like the rte_lcore_id() is LCORE_ID_ANY (i.e. UINT32_MAX) for all unregistered threads; but it will be usable for array indexing, behaving as a shadow variable of RTE_PER_LCORE(_lcore_id) for optimizing away the "rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE" when indexing. > > We definitely shouldn't further complicate anything related to the DPDK > threading model, in my opinion. > > If a module needs one or more variable instances that aren't per lcore, > use regular static allocation instead. I would favor clarity over > convenience here, at least until we know better (see below as well). > > >> > >> But useful? Sure. > >> > >> I think you may still need other data for dealing with unregistered > >> threads, for example a mutex or spin lock to deal with concurrency > >> issues that arises with shared data. > > > > Adding the extra entry is only for the benefit of use cases where > special handling is not required. It will make the code for those use > cases much cleaner. I think it is useful. > > > > It will make it shorter, but not less clean, I would argue. > > > Use cases requiring special handling should still do the special > handling they do today. > > > > For DPDK modules using lcore variables and which treat unregistered > threads as "full citizens", I expect special handling of unregistered > threads to be the norm. Take rte_random.h as an example. Current API > does not guarantee MT safety for concurrent calls of unregistered > threads. It probably should, and it should probably be by means of a > mutex (not spinlock). > > The reason I'm not running off to make a rte_random.c patch is that's > it's unclear to me what is the role of unregistered threads in DPDK. > I'm > reasonably comfortable with a model where there are many threads that > basically don't interact with the DPDK APIs (except maybe some very > narrow exposure, like the preemption-safe ring variant). One example of > such a design would be big slow control plane which uses multi- > threading > and the Linux process scheduler for work scheduling, hosted in the same > process as a DPDK data plane app. > > What I find more strange is a scenario where there are unregistered > threads which interacts with a wide variety of DPDK APIs, does so > at-high-rates/with-high-performance-requirements and are expected to be > preemption-safe. So they are basically EAL threads without a lcore id. Yes, this is happening in the wild. E.g. our application has a mode where it uses fewer EAL threads, and processes more in non-EAL threads. So to say, the same work is processed either by an EAL thread or a non-EAL thread, depending on the application's mode. The extra array entry would be useful for such use cases. > > Support for that latter scenario has also been voiced, in previous > discussions, from what I recall. > > I think it's hard to answer the question of a "unregistered thread > spare" for lcore variables without first knowing what the future should > look like for unregistered threads in DPDK, in terms of being able to > call into DPDK APIs, preemption-safety guarantees, etc. > > It seems that until you have a clearer picture of how generally to > treat > unregistered threads, you are best off with just a per-lcore id > instance > of lcore variables. I get your point. It also reduces the risk of bugs caused by incorrect use of the additional entry. I am arguing for a different angle: Providing the extra entry will help uncovering relevant use cases. > > >> > >> There may also be cases were you are best off by simply disallowing > >> unregistered threads from calling into that API. > >> > >>> Obviously, this might affect performance. > >>> If the performance cost is not negligble, the addtional entry (and > >> indexing branch) could be disabled at build time. > >>> > >>> > >>> * Suggestion: Do not fix the alignment at 16 byte. > >>> Pass an alignment parameter to rte_lcore_var_alloc() and use > >> alignof() when calling it: > >>> > >>> +#include <stdalign.h> > >>> + > >>> +#define RTE_LCORE_VAR_ALLOC(name) \ > >>> + name = rte_lcore_var_alloc(sizeof(*(name)), alignof(*(name))) > >>> + > >>> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, alignment) > \ > >>> + name = rte_lcore_var_alloc(size, alignment) > >>> + > >>> +#define RTE_LCORE_VAR_ALLOC_SIZE(name, size) \ > >>> + name = rte_lcore_var_alloc(size, RTE_LCORE_VAR_ALIGNMENT_DEFAULT) > >>> + > >>> + +++ /cconfig/rte_config.h > >>> +#define RTE_LCORE_VAR_ALIGNMENT_DEFAULT 16 > >>> > >>> > >> > >> That seems like a very good idea. I'll look into it. > >> > >>> * Concern: RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(), > but > >> behaves differently. > >>> > >>>> +/** > >>>> + * Iterate over each lcore id's value for a lcore variable. > >>>> + */ > >>>> +#define RTE_LCORE_VAR_FOREACH(var, name) > >>>> \ > >>>> + for (unsigned int lcore_id = > >>>> \ > >>>> + (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0); > \ > >>>> + lcore_id < RTE_MAX_LCORE; > \ > >>>> + lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, > name)) > >>>> + > >>> > >>> The macro name RTE_LCORE_VAR_FOREACH() resembles > >> RTE_LCORE_FOREACH(i), which only iterates on running cores. > >>> You might want to give it a name that differs more. > >>> > >> > >> True. > >> > >> Maybe RTE_LCORE_VAR_FOREACH_VALUE() is better? Still room for > >> confusion, > >> for sure. > >> > >> Being consistent with <rte_lcore.h> is not so easy, since it's not > even > >> consistent with itself. For example, rte_lcore_count() returns the > >> number of lcores (EAL threads) *plus the number of registered non- > EAL > >> threads*, and RTE_LCORE_FOREACH() give a different count. :) > > > > Naming is hard. I don't have a good name, and can only offer > inspiration... > > > > <rte_lcore.h> has RTE_LCORE_FOREACH() and its > RTE_LCORE_FOREACH_WORKER() variant with _WORKER appended. > > > > Perhaps RTE_LCORE_VAR_FOREACH_ALL(), with _ALL appended to indicate a > variant. > > > >> > >>> If it wasn't for API breakage, I would suggest renaming > >> RTE_LCORE_FOREACH() instead, but that's not realistic. ;-) > >>> > >>> Small detail: "var" is a pointer, so consider renaming it to "ptr" > >> and adding _PTR to the macro name. > >> > >> The "var" name comes from how <sys/queue.h> names things. I think I > had > >> it as "ptr" initially. I'll change it back. > > > > Thanks. > > > >> > >> Thanks a lot Morten.