On 2024/9/12 13:35, Mattias Rönnblom wrote:
> On 2024-09-12 04:33, fengchengwen wrote:
>> On 2024/9/12 1:04, Mattias Rönnblom wrote:
>>> Introduce DPDK per-lcore id variables, or lcore variables for short.
>>>
>>> An lcore variable has one value for every current and future lcore
>>> id-equipped thread.
>>>
>>> The primary <rte_lcore_var.h> use case is for statically allocating
>>> small, frequently-accessed data structures, for which one instance
>>> should exist for each lcore.
>>>
>>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>>> _Thread_local), but decoupling the values' life time with that of the
>>> threads.
>>>
>>> Lcore variables are also similar in terms of functionality provided by
>>> FreeBSD kernel's DPCPU_*() family of macros and the associated
>>> build-time machinery. DPCPU uses linker scripts, which effectively
>>> prevents the reuse of its, otherwise seemingly viable, approach.
>>>
>>> The currently-prevailing way to solve the same problem as lcore
>>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
>>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
>>> lcore variables over this approach is that data related to the same
>>> lcore now is close (spatially, in memory), rather than data used by
>>> the same module, which in turn avoid excessive use of padding,
>>> polluting caches with unused data.
>>>
>>> Signed-off-by: Mattias Rönnblom <mattias.ronnb...@ericsson.com>
>>> Acked-by: Morten Brørup <m...@smartsharesystems.com>
>>>
>>> -- 
>>>
>>> PATCH v2:
>>>   * Add Windows support. (Morten Brørup)
>>>   * Fix lcore variables API index reference. (Morten Brørup)
>>>   * Various improvements of the API documentation. (Morten Brørup)
>>>   * Elimination of unused symbol in version.map. (Morten Brørup)
>>
>> these history could move to the cover letter.
>>
>>>
>>> PATCH:
>>>   * Update MAINTAINERS and release notes.
>>>   * Stop covering included files in extern "C" {}.
>>>
>>> RFC v6:
>>>   * Include <stdlib.h> to get aligned_alloc().
>>>   * Tweak documentation (grammar).
>>>   * Provide API-level guarantees that lcore variable values take on an
>>>     initial value of zero.
>>>   * Fix misplaced __rte_cache_aligned in the API doc example.
>>>
>>> RFC v5:
>>>   * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
>>>   * The RTE_LCORE_VAR_GET() and SET() convience access macros
>>>     covered an uncommon use case, where the lcore value is of a
>>>     primitive type, rather than a struct, and is thus eliminated
>>>     from the API. (Morten Brørup)
>>>   * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
>>>     RTE_LCORE_VAR_VALUE().
>>>   * The underscores are removed from __rte_lcore_var_lcore_ptr() to
>>>     signal that this function is a part of the public API.
>>>   * Macro arguments are documented.
>>>
>>> RFV v4:
>>>   * Replace large static array with libc heap-allocated memory. One
>>>     implication of this change is there no longer exists a fixed upper
>>>     bound for the total amount of memory used by lcore variables.
>>>     RTE_MAX_LCORE_VAR has changed meaning, and now represent the
>>>     maximum size of any individual lcore variable value.
>>>   * Fix issues in example. (Morten Brørup)
>>>   * Improve access macro type checking. (Morten Brørup)
>>>   * Refer to the lcore variable handle as "handle" and not "name" in
>>>     various macros.
>>>   * Document lack of thread safety in rte_lcore_var_alloc().
>>>   * Provide API-level assurance the lcore variable handle is
>>>     always non-NULL, to all applications to use NULL to mean
>>>     "not yet allocated".
>>>   * Note zero-sized allocations are not allowed.
>>>   * Give API-level guarantee the lcore variable values are zeroed.
>>>
>>> RFC v3:
>>>   * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
>>>   * Update example to reflect FOREACH macro name change (in RFC v2).
>>>
>>> RFC v2:
>>>   * Use alignof to derive alignment requirements. (Morten Brørup)
>>>   * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
>>>     *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
>>>   * Allow user-specified alignment, but limit max to cache line size.
>>> ---
>>>   MAINTAINERS                            |   6 +
>>>   config/rte_config.h                    |   1 +
>>>   doc/api/doxy-api-index.md              |   1 +
>>>   doc/guides/rel_notes/release_24_11.rst |  14 +
>>>   lib/eal/common/eal_common_lcore_var.c  |  78 +++++
>>>   lib/eal/common/meson.build             |   1 +
>>>   lib/eal/include/meson.build            |   1 +
>>>   lib/eal/include/rte_lcore_var.h        | 385 +++++++++++++++++++++++++
>>>   lib/eal/version.map                    |   2 +
>>>   9 files changed, 489 insertions(+)
>>>   create mode 100644 lib/eal/common/eal_common_lcore_var.c
>>>   create mode 100644 lib/eal/include/rte_lcore_var.h
>>>
>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>> index c5a703b5c0..362d9a3f28 100644
>>> --- a/MAINTAINERS
>>> +++ b/MAINTAINERS
>>> @@ -282,6 +282,12 @@ F: lib/eal/include/rte_random.h
>>>   F: lib/eal/common/rte_random.c
>>>   F: app/test/test_rand_perf.c
>>>   +Lcore Variables
>>> +M: Mattias Rönnblom <mattias.ronnb...@ericsson.com>
>>> +F: lib/eal/include/rte_lcore_var.h
>>> +F: lib/eal/common/eal_common_lcore_var.c
>>> +F: app/test/test_lcore_var.c
>>> +
>>>   ARM v7
>>>   M: Wathsala Vithanage <wathsala.vithan...@arm.com>
>>>   F: config/arm/
>>> diff --git a/config/rte_config.h b/config/rte_config.h
>>> index dd7bb0d35b..311692e498 100644
>>> --- a/config/rte_config.h
>>> +++ b/config/rte_config.h
>>> @@ -41,6 +41,7 @@
>>>   /* EAL defines */
>>>   #define RTE_CACHE_GUARD_LINES 1
>>>   #define RTE_MAX_HEAPS 32
>>> +#define RTE_MAX_LCORE_VAR 1048576
>>>   #define RTE_MAX_MEMSEG_LISTS 128
>>>   #define RTE_MAX_MEMSEG_PER_LIST 8192
>>>   #define RTE_MAX_MEM_MB_PER_LIST 32768
>>> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
>>> index f9f0300126..ed577f14ee 100644
>>> --- a/doc/api/doxy-api-index.md
>>> +++ b/doc/api/doxy-api-index.md
>>> @@ -99,6 +99,7 @@ The public API headers are grouped by topics:
>>>     [interrupts](@ref rte_interrupts.h),
>>>     [launch](@ref rte_launch.h),
>>>     [lcore](@ref rte_lcore.h),
>>> +  [lcore variables](@ref rte_lcore_var.h),
>>>     [per-lcore](@ref rte_per_lcore.h),
>>>     [service cores](@ref rte_service.h),
>>>     [keepalive](@ref rte_keepalive.h),
>>> diff --git a/doc/guides/rel_notes/release_24_11.rst 
>>> b/doc/guides/rel_notes/release_24_11.rst
>>> index 0ff70d9057..a3884f7491 100644
>>> --- a/doc/guides/rel_notes/release_24_11.rst
>>> +++ b/doc/guides/rel_notes/release_24_11.rst
>>> @@ -55,6 +55,20 @@ New Features
>>>        Also, make sure to start the actual text at the margin.
>>>        =======================================================
>>>   +* **Added EAL per-lcore static memory allocation facility.**
>>> +
>>> +    Added EAL API <rte_lcore_var.h> for statically allocating small,
>>> +    frequently-accessed data structures, for which one instance should
>>> +    exist for each EAL thread and registered non-EAL thread.
>>> +
>>> +    With lcore variables, data is organized spatially on a per-lcore id
>>> +    basis, rather than per library or PMD, avoiding the need for cache
>>> +    aligning (or RTE_CACHE_GUARDing) data structures, which in turn
>>> +    reduces CPU cache internal fragmentation, improving performance.
>>> +
>>> +    Lcore variables are similar to thread-local storage (TLS, e.g.,
>>> +    C11 _Thread_local), but decoupling the values' life time from that
>>> +    of the threads.
>>>     Removed Items
>>>   -------------
>>> diff --git a/lib/eal/common/eal_common_lcore_var.c 
>>> b/lib/eal/common/eal_common_lcore_var.c
>>> new file mode 100644
>>> index 0000000000..309822039b
>>> --- /dev/null
>>> +++ b/lib/eal/common/eal_common_lcore_var.c
>>> @@ -0,0 +1,78 @@
>>> +/* SPDX-License-Identifier: BSD-3-Clause
>>> + * Copyright(c) 2024 Ericsson AB
>>> + */
>>> +
>>> +#include <inttypes.h>
>>> +#include <stdlib.h>
>>> +
>>> +#ifdef RTE_EXEC_ENV_WINDOWS
>>> +#include <malloc.h>
>>> +#endif
>>> +
>>> +#include <rte_common.h>
>>> +#include <rte_debug.h>
>>> +#include <rte_log.h>
>>> +
>>> +#include <rte_lcore_var.h>
>>> +
>>> +#include "eal_private.h"
>>> +
>>> +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
>>> +
>>> +static void *lcore_buffer;
>>> +static size_t offset = RTE_MAX_LCORE_VAR;
>>> +
>>> +static void *
>>> +lcore_var_alloc(size_t size, size_t align)
>>> +{
>>> +    void *handle;
>>> +    void *value;
>>> +
>>> +    offset = RTE_ALIGN_CEIL(offset, align);
>>> +
>>> +    if (offset + size > RTE_MAX_LCORE_VAR) {
>>> +#ifdef RTE_EXEC_ENV_WINDOWS
>>> +        lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
>>> +                           RTE_CACHE_LINE_SIZE);
>>> +#else
>>> +        lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
>>> +                         LCORE_BUFFER_SIZE);
>>> +#endif
>>> +        RTE_VERIFY(lcore_buffer != NULL);
>>> +
>>> +        offset = 0;
>>> +    }
>>> +
>>> +    handle = RTE_PTR_ADD(lcore_buffer, offset);
>>> +
>>> +    offset += size;
>>> +
>>> +    RTE_LCORE_VAR_FOREACH_VALUE(value, handle)
>>> +        memset(value, 0, size);
>>> +
>>> +    EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
>>> +        "%"PRIuPTR"-byte alignment", size, align);
>>
>> Currrent the data was malloc by libc function, I think it's mainly for such 
>> INIT macro which will be init before main.
>> But it will introduce following problem:
>> 1\ it can't benefit from huge-pages. this patch may reserved many 1MBs for 
>> each lcore, if we could place it in huge-pages it will reduce the TLB miss 
>> rate, especially it freq access data.
> 
> This mechanism is for small allocations, which the sum of is also expected to 
> be small (although the system won't break if they aren't).
> 
> If you have large allocations, you are better off using lazy huge page 
> allocations further down the initialization process. Otherwise, you will end 
> up using memory for RTE_MAX_LCORE instances, rather than the actual lcore 
> count, which could be substantially smaller.

Yes, it may cost two much memory if allocated from hugepage memory.

> 
> But sure, everything else being equal, you could have used huge pages for 
> these lcore variable values. But everything isn't equal.
> 
>> 2\ it can't across multi-process. many of current lcore-data also don't 
>> support multi-process, but I think it worth do that, and it will help us to 
>> some service recovery when sub-process failed and reboot.
>>
>> ...
>>
> 
> Not sure I think that's a downside. Further cementing that anti-pattern into 
> DPDK seems to be a bad idea to me.
> 
> lcore variables doesn't *introduce* any of these issues, since the mechanisms 
> it's replacing also have these shortcomings (if you think about them as such 
> - I'm not sure I do).

Got it.

This feature is a enhanced for current lcore variables, which bring together 
scattered data from the point view of a single core.
and current it seemmed hard to extend support hugepage memory.

Reply via email to