Hello Vipin and others, please, will there be any progress or update on this series?
I successfully tested those changes on our Intel and AMD machines and would like to use it in production soon. The API is a little bit unintuitive, at least for me, but I successfully integrated into our software. I am missing a clear relation to the NUMA socket approach used in DPDK. E.g. I would like to be able to easily walk over a list of lcores from a specific NUMA node grouped by L3 domain. Yes, there is the RTE_LCORE_DOMAIN_IO, but would it always match the appropriate socket IDs? Also, I do not clearly understand what is the purpose of using domain selector like: RTE_LCORE_DOMAIN_L1 | RTE_LCORE_DOMAIN_L2 or even: RTE_LCORE_DOMAIN_L3 | RTE_LCORE_DOMAIN_L2 the documentation does not explain this. I could not spot any kind of grouping that would help me in any way. Some "best practices" examples would be nice to have to understand the intentions better. I found a little catch when running DPDK with more lcores than there are physical or SMT CPU cores. This happens when using e.g. an option like --lcores=(0-15)@(0-1). The results from the topology API would not match the lcores because hwloc is not aware of the lcores concept. This might be mentioned somewhere. Anyway, I really appreciate this work and would like to see it upstream. Especially for AMD machines, some framework like this is a must. Kind regards, Jan On Tue, 5 Nov 2024 15:58:45 +0530 Vipin Varghese <vipin.vargh...@amd.com> wrote: > This patch introduces improvements for NUMA topology awareness in > relation to DPDK logical cores. The goal is to expose API which allows > users to select optimal logical cores for any application. These > logical cores can be selected from various NUMA domains like CPU and > I/O. > > Change Summary: > - Introduces the concept of NUMA domain partitioning based on CPU and > I/O topology. > - Adds support for grouping DPDK logical cores within the same Cache > and I/O domain for improved locality. > - Implements topology detection and core grouping logic that > distinguishes between the following NUMA configurations: > * CPU topology & I/O topology (e.g., AMD SoC EPYC, Intel Xeon SPR) > * CPU+I/O topology (e.g., Ampere One with SLC, Intel Xeon SPR > with SNC) > - Enhances performance by minimizing lcore dispersion across > tiles|compute package with different L2/L3 cache or IO domains. > > Reason: > - Applications using DPDK libraries relies on consistent memory > access. > - Lcores being closer to same NUMA domain as IO. > - Lcores sharing same cache. > > Latency is minimized by using lcores that share the same NUMA > topology. Memory access is optimized by utilizing cores within the > same NUMA domain or tile. Cache coherence is preserved within the > same shared cache domain, reducing the remote access from > tile|compute package via snooping (local hit in either L2 or L3 > within same NUMA domain). > > Library dependency: hwloc > > Topology Flags: > --------------- > - RTE_LCORE_DOMAIN_L1: to group cores sharing same L1 cache > - RTE_LCORE_DOMAIN_SMT: same as RTE_LCORE_DOMAIN_L1 > - RTE_LCORE_DOMAIN_L2: group cores sharing same L2 cache > - RTE_LCORE_DOMAIN_L3: group cores sharing same L3 cache > - RTE_LCORE_DOMAIN_L4: group cores sharing same L4 cache > - RTE_LCORE_DOMAIN_IO: group cores sharing same IO > > < Function: Purpose > > --------------------- > - rte_get_domain_count: get domain count based on Topology Flag > - rte_lcore_count_from_domain: get valid lcores count under each > domain > - rte_get_lcore_in_domain: valid lcore id based on index > - rte_lcore_cpuset_in_domain: return valid cpuset based on index > - rte_lcore_is_main_in_domain: return true|false if main lcore is > present > - rte_get_next_lcore_from_domain: next valid lcore within domain > - rte_get_next_lcore_from_next_domain: next valid lcore from next > domain > > Note: > 1. Topology is NUMA grouping. > 2. Domain is various sub-groups within a specific Topology. > > Topology example: L1, L2, L3, L4, IO > Domian example: IO-A, IO-B > > < MACRO: Purpose > > ------------------ > - RTE_LCORE_FOREACH_DOMAIN: iterate lcores from all domains > - RTE_LCORE_FOREACH_WORKER_DOMAIN: iterate worker lcores from all > domains > - RTE_LCORE_FORN_NEXT_DOMAIN: iterate domain select n'th lcore > - RTE_LCORE_FORN_WORKER_NEXT_DOMAIN: iterate domain for worker n'th > lcore. > > Future work (after merge): > -------------------------- > - dma-perf per IO NUMA > - eventdev per L3 NUMA > - pipeline per SMT|L3 NUMA > - distributor per L3 for Port-Queue > - l2fwd-power per SMT > - testpmd option for IO NUMA per port > > Platform tested on: > ------------------- > - INTEL(R) XEON(R) PLATINUM 8562Y+ (support IO numa 1 & 2) > - AMD EPYC 8534P (supports IO numa 1 & 2) > - AMD EPYC 9554 (supports IO numa 1, 2, 4) > > Logs: > ----- > 1. INTEL(R) XEON(R) PLATINUM 8562Y+: > - SNC=1 > Domain (IO): at index (0) there are 48 core, with (0) at > index 0 > - SNC=2 > Domain (IO): at index (0) there are 24 core, with (0) at > index 0 Domain (IO): at index (1) there are 24 core, with (12) at > index 0 > > 2. AMD EPYC 8534P: > - NPS=1: > Domain (IO): at index (0) there are 128 core, with (0) at > index 0 > - NPS=2: > Domain (IO): at index (0) there are 64 core, with (0) at > index 0 Domain (IO): at index (1) there are 64 core, with (32) at > index 0 > > Signed-off-by: Vipin Varghese <vipin.vargh...@amd.com> > > Vipin Varghese (4): > eal/lcore: add topology based functions > test/lcore: enable tests for topology > doc: add topology grouping details > examples: update with lcore topology API > > app/test/test_lcores.c | 528 +++++++++++++ > config/meson.build | 18 + > .../prog_guide/env_abstraction_layer.rst | 22 + > examples/helloworld/main.c | 154 +++- > examples/l2fwd/main.c | 56 +- > examples/skeleton/basicfwd.c | 22 + > lib/eal/common/eal_common_lcore.c | 714 > ++++++++++++++++++ lib/eal/common/eal_private.h | > 58 ++ lib/eal/freebsd/eal.c | 10 + > lib/eal/include/rte_lcore.h | 209 +++++ > lib/eal/linux/eal.c | 11 + > lib/eal/meson.build | 4 + > lib/eal/version.map | 11 + > lib/eal/windows/eal.c | 12 + > 14 files changed, 1819 insertions(+), 10 deletions(-) >