On 9/2/2024 5:33 PM, Varghese, Vipin wrote:
<snipped>

I recently looked into how Intel's Sub-NUMA Clustering would work within
DPDK, and found that I actually didn't have to do anything, because the
SNC "clusters" present themselves as NUMA nodes, which DPDK already
supports natively.

yes, this is correct. In Intel Xeon Platinum BIOS one can enable
`Cluster per NUMA` as `1,2 or4`.

This divides the tiles into Sub-Numa parition, each having separate
lcores,memory controllers, PCIe

and accelerator.


Does AMD's implementation of chiplets not report themselves as separate
NUMA nodes?

In AMD EPYC Soc, this is different. There are 2 BIOS settings, namely

1. NPS: `Numa Per Socket` which allows the IO tile (memory, PCIe and
Accelerator) to be partitioned as Numa 0, 1, 2 or 4.

2. L3 as NUMA: `L3 cache of CPU tiles as individual NUMA`. This allows
all CPU tiles to be independent NUMA cores.


The above settings are possible because CPU is independent from IO tile.
Thus allowing 4 combinations be available for use.

Sure, but presumably if the user wants to distinguish this, they have to
configure their system appropriately. If user wants to take advantage of
L3 as NUMA (which is what your patch proposes), then they can enable the
BIOS knob and get that functionality for free. DPDK already supports this.

The intend of the RFC is to introduce the ability to select lcore within the same

L3 cache whether the BIOS is set or unset for `L3 as NUMA`. This is also achieved

and tested on platforms which advertises via sysfs by OS kernel. Thus eliminating

the dependency on hwloc and libuma which can be different versions in different distros.

But we do depend on libnuma, so we might as well depend on it? Are there different versions of libnuma that interfere with what you're trying to do? You keep coming back to this "whether the BIOS is set or unset" for L3 as NUMA, but I'm still unclear as to what issues your patch is solving assuming "knob is set". When the system is configured correctly, it already works and reports cores as part of NUMA nodes (as L3) correctly. It is only when the system is configured *not* to do that that issues arise, is it not? In which case IMO the easier solution would be to just tell the user to enable that knob in BIOS?




These are covered in the tuning gudie for the SoC in 12. How to get best
performance on AMD platform — Data Plane Development Kit 24.07.0
documentation (dpdk.org)
<https://doc.dpdk.org/guides/linux_gsg/amd_platform.html>.


Because if it does, I don't really think any changes are
required because NUMA nodes would give you the same thing, would it not?

I have a different opinion to this outlook. An end user can

1. Identify the lcores and it's NUMA user `usertools/cpu-layout.py`

I recently submitted an enhacement for CPU layout script to print out
NUMA separately from physical socket [1].

[1]
https://patches.dpdk.org/project/dpdk/patch/40cf4ee32f15952457ac5526cfce64728bd13d32.1724323106.git.anatoly.bura...@intel.com/

I believe when "L3 as NUMA" is enabled in BIOS, the script will display
both physical package ID as well as NUMA nodes reported by the system,
which will be different from physical package ID, and which will display
information you were looking for.

As AMD we had submitted earlier work on the same via usertools: enhance logic to display NUMA - Patchwork (dpdk.org) <https://patchwork.dpdk.org/project/dpdk/patch/20220326073207.489694-1-vipin.vargh...@amd.com/>.

this clearly were distinguishing NUMA and Physical socket.

Oh, cool, I didn't see that patch. I would argue my visual format is more readable though, so perhaps we can get that in :)

Agreed, but as pointed out in case of Intel Xeon Platinum SPR, the tile consists of cpu, memory, pcie and accelerator.

hence setting the BIOS option `Cluster per NUMA` the OS kernel & libnuma display appropriate Domain with memory, pcie and cpu.


In case of AMD SoC, libnuma for CPU is different from memory NUMA per socket.

I'm curious how does the kernel handle this then, and what are you getting from libnuma. You seem to be implying that there are two different NUMA nodes on your SoC, and either kernel or libnuma are in conflict as to what belongs to what NUMA node?




3. there are no API which distinguish L3 numa domain. Function
`rte_socket_id
<https://doc.dpdk.org/api/rte__lcore_8h.html#a7c8da4664df26a64cf05dc508a4f26df>`
 for CPU tiles like AMD SoC will return physical socket.

Sure, but I would think the answer to that would be to introduce an API
to distinguish between NUMA (socket ID in DPDK parlance) and package
(physical socket ID in the "traditional NUMA" sense). Once we can
distinguish between those, DPDK can just rely on NUMA information
provided by the OS, while still being capable of identifying physical
sockets if the user so desires.
Agreed, +1 for the idea for physcial socket and changes in library to exploit the same.

I am actually going to introduce API to get *physical socket* (as
opposed to NUMA node) in the next few days.

But how does it solve the end customer issues

1. if there are multiple NIC or Accelerator on multiple socket, but IO tile is partitioned to Sub Domain.

At least on Intel platforms, NUMA node gets assigned correctly - that is, if my Xeon with SNC enabled has NUMA nodes 3,4 on socket 1, and there's a NIC connected to socket 1, it's going to show up as being on NUMA node 3 or 4 depending on where exactly I plugged it in. Everything already works as expected, and there is no need for any changes for Intel platforms (at least none that I can see).

My proposed API is really for those users who wish to explicitly allow for reserving memory/cores on "the same physical socket", as "on the same tile" is already taken care of by NUMA nodes.


2. If RTE_FLOW steering is applied on NIC which needs to processed under same L3 - reduces noisy neighbor and better cache hits

3, for PKT-distribute library which needs to run within same worker lcore set as RX-Distributor-TX.


Same as above: on Intel platforms, NUMA nodes already solve this.

<snip>

Totally agree, that is what the RFC is also doing, based on what OS sees as NUMA we are using it.

Only addition is within the NUMA if there are split LLC, allow selection of those lcores. Rather than blindly choosing lcore using

rte_lcore_get_next.

It feels like we're working around a problem that shouldn't exist in the first place, because kernel should already report this information. Within NUMA subsystem, there is sysfs node "distance" that, at least on Intel platforms and in certain BIOS configuration, reports distance between NUMA nodes, from which one can make inferences about how far a specific NUMA node is from any other NUMA node. This could have been used to encode L3 cache information. Do AMD platforms not do that? In that case, "lcore next" for a particular socket ID (NUMA node, in reality) should already get us any cores that are close to each other, because all of this information is already encoded in NUMA nodes by the system.

I feel like there's a disconnect between my understanding of the problem space, and yours, so I'm going to ask a very basic question:

Assuming the user has configured their AMD system correctly (i.e. enabled L3 as NUMA), are there any problem to be solved by adding a new API? Does the system not report each L3 as a separate NUMA node?



We force the user to configure their system
correctly as it is, and I see no reason to second-guess user's BIOS
configuration otherwise.

Again iterating, the changes suggested in RFC are agnostic to what BIOS options are used,

But that is exactly my contention: are we not effectively working around users' misconfiguration of a system then?

--
Thanks,
Anatoly

Reply via email to