Hello Vipin and others,

please, will there be any progress or update on this series?

I successfully tested those changes on our Intel and AMD machines and
would like to use it in production soon.

The API is a little bit unintuitive, at least for me, but I
successfully integrated into our software.

I am missing a clear relation to the NUMA socket approach used in DPDK.
E.g. I would like to be able to easily walk over a list of lcores from
a specific NUMA node grouped by L3 domain. Yes, there is the
RTE_LCORE_DOMAIN_IO, but would it always match the appropriate socket
IDs?

Also, I do not clearly understand what is the purpose of using domain
selector like:

  RTE_LCORE_DOMAIN_L1 | RTE_LCORE_DOMAIN_L2

or even:

  RTE_LCORE_DOMAIN_L3 | RTE_LCORE_DOMAIN_L2

the documentation does not explain this. I could not spot any kind of
grouping that would help me in any way. Some "best practices" examples
would be nice to have to understand the intentions better.

I found a little catch when running DPDK with more lcores than there
are physical or SMT CPU cores. This happens when using e.g. an option
like --lcores=(0-15)@(0-1). The results from the topology API would not
match the lcores because hwloc is not aware of the lcores concept. This
might be mentioned somewhere.

Anyway, I really appreciate this work and would like to see it upstream.
Especially for AMD machines, some framework like this is a must.

Kind regards,
Jan

On Tue, 5 Nov 2024 15:58:45 +0530
Vipin Varghese <vipin.vargh...@amd.com> wrote:

> This patch introduces improvements for NUMA topology awareness in
> relation to DPDK logical cores. The goal is to expose API which allows
> users to select optimal logical cores for any application. These
> logical cores can be selected from various NUMA domains like CPU and
> I/O.
> 
> Change Summary:
>  - Introduces the concept of NUMA domain partitioning based on CPU and
>    I/O topology.
>  - Adds support for grouping DPDK logical cores within the same Cache
>    and I/O domain for improved locality.
>  - Implements topology detection and core grouping logic that
>    distinguishes between the following NUMA configurations:
>     * CPU topology & I/O topology (e.g., AMD SoC EPYC, Intel Xeon SPR)
>     * CPU+I/O topology (e.g., Ampere One with SLC, Intel Xeon SPR
> with SNC)
>  - Enhances performance by minimizing lcore dispersion across
> tiles|compute package with different L2/L3 cache or IO domains.
> 
> Reason:
>  - Applications using DPDK libraries relies on consistent memory
> access.
>  - Lcores being closer to same NUMA domain as IO.
>  - Lcores sharing same cache.
> 
> Latency is minimized by using lcores that share the same NUMA
> topology. Memory access is optimized by utilizing cores within the
> same NUMA domain or tile. Cache coherence is preserved within the
> same shared cache domain, reducing the remote access from
> tile|compute package via snooping (local hit in either L2 or L3
> within same NUMA domain).
> 
> Library dependency: hwloc
> 
> Topology Flags:
> ---------------
>  - RTE_LCORE_DOMAIN_L1: to group cores sharing same L1 cache
>  - RTE_LCORE_DOMAIN_SMT: same as RTE_LCORE_DOMAIN_L1
>  - RTE_LCORE_DOMAIN_L2: group cores sharing same L2 cache
>  - RTE_LCORE_DOMAIN_L3: group cores sharing same L3 cache
>  - RTE_LCORE_DOMAIN_L4: group cores sharing same L4 cache
>  - RTE_LCORE_DOMAIN_IO: group cores sharing same IO
> 
> < Function: Purpose >
> ---------------------
>  - rte_get_domain_count: get domain count based on Topology Flag
>  - rte_lcore_count_from_domain: get valid lcores count under each
> domain
>  - rte_get_lcore_in_domain: valid lcore id based on index
>  - rte_lcore_cpuset_in_domain: return valid cpuset based on index
>  - rte_lcore_is_main_in_domain: return true|false if main lcore is
> present
>  - rte_get_next_lcore_from_domain: next valid lcore within domain
>  - rte_get_next_lcore_from_next_domain: next valid lcore from next
> domain
> 
> Note:
>  1. Topology is NUMA grouping.
>  2. Domain is various sub-groups within a specific Topology.
> 
> Topology example: L1, L2, L3, L4, IO
> Domian example: IO-A, IO-B
> 
> < MACRO: Purpose >
> ------------------
>  - RTE_LCORE_FOREACH_DOMAIN: iterate lcores from all domains
>  - RTE_LCORE_FOREACH_WORKER_DOMAIN: iterate worker lcores from all
> domains
>  - RTE_LCORE_FORN_NEXT_DOMAIN: iterate domain select n'th lcore
>  - RTE_LCORE_FORN_WORKER_NEXT_DOMAIN: iterate domain for worker n'th
> lcore.
> 
> Future work (after merge):
> --------------------------
>  - dma-perf per IO NUMA
>  - eventdev per L3 NUMA
>  - pipeline per SMT|L3 NUMA
>  - distributor per L3 for Port-Queue
>  - l2fwd-power per SMT
>  - testpmd option for IO NUMA per port
> 
> Platform tested on:
> -------------------
>  - INTEL(R) XEON(R) PLATINUM 8562Y+ (support IO numa 1 & 2)
>  - AMD EPYC 8534P (supports IO numa 1 & 2)
>  - AMD EPYC 9554 (supports IO numa 1, 2, 4)
> 
> Logs:
> -----
> 1. INTEL(R) XEON(R) PLATINUM 8562Y+:
>  - SNC=1
>         Domain (IO): at index (0) there are 48 core, with (0) at
> index 0
>  - SNC=2
>         Domain (IO): at index (0) there are 24 core, with (0) at
> index 0 Domain (IO): at index (1) there are 24 core, with (12) at
> index 0
> 
> 2. AMD EPYC 8534P:
>  - NPS=1:
>         Domain (IO): at index (0) there are 128 core, with (0) at
> index 0
>  - NPS=2:
>         Domain (IO): at index (0) there are 64 core, with (0) at
> index 0 Domain (IO): at index (1) there are 64 core, with (32) at
> index 0
> 
> Signed-off-by: Vipin Varghese <vipin.vargh...@amd.com>
> 
> Vipin Varghese (4):
>   eal/lcore: add topology based functions
>   test/lcore: enable tests for topology
>   doc: add topology grouping details
>   examples: update with lcore topology API
> 
>  app/test/test_lcores.c                        | 528 +++++++++++++
>  config/meson.build                            |  18 +
>  .../prog_guide/env_abstraction_layer.rst      |  22 +
>  examples/helloworld/main.c                    | 154 +++-
>  examples/l2fwd/main.c                         |  56 +-
>  examples/skeleton/basicfwd.c                  |  22 +
>  lib/eal/common/eal_common_lcore.c             | 714
> ++++++++++++++++++ lib/eal/common/eal_private.h                  |
> 58 ++ lib/eal/freebsd/eal.c                         |  10 +
>  lib/eal/include/rte_lcore.h                   | 209 +++++
>  lib/eal/linux/eal.c                           |  11 +
>  lib/eal/meson.build                           |   4 +
>  lib/eal/version.map                           |  11 +
>  lib/eal/windows/eal.c                         |  12 +
>  14 files changed, 1819 insertions(+), 10 deletions(-)
> 

Reply via email to