On 9/2/2024 5:33 PM, Varghese, Vipin wrote:
<snipped>
I recently looked into how Intel's Sub-NUMA Clustering would work
within
DPDK, and found that I actually didn't have to do anything, because the
SNC "clusters" present themselves as NUMA nodes, which DPDK already
supports natively.
yes, this is correct. In Intel Xeon Platinum BIOS one can enable
`Cluster per NUMA` as `1,2 or4`.
This divides the tiles into Sub-Numa parition, each having separate
lcores,memory controllers, PCIe
and accelerator.
Does AMD's implementation of chiplets not report themselves as separate
NUMA nodes?
In AMD EPYC Soc, this is different. There are 2 BIOS settings, namely
1. NPS: `Numa Per Socket` which allows the IO tile (memory, PCIe and
Accelerator) to be partitioned as Numa 0, 1, 2 or 4.
2. L3 as NUMA: `L3 cache of CPU tiles as individual NUMA`. This allows
all CPU tiles to be independent NUMA cores.
The above settings are possible because CPU is independent from IO tile.
Thus allowing 4 combinations be available for use.
Sure, but presumably if the user wants to distinguish this, they have to
configure their system appropriately. If user wants to take advantage of
L3 as NUMA (which is what your patch proposes), then they can enable the
BIOS knob and get that functionality for free. DPDK already supports
this.
The intend of the RFC is to introduce the ability to select lcore within
the same
L3 cache whether the BIOS is set or unset for `L3 as NUMA`. This is also
achieved
and tested on platforms which advertises via sysfs by OS kernel. Thus
eliminating
the dependency on hwloc and libuma which can be different versions in
different distros.
But we do depend on libnuma, so we might as well depend on it? Are there
different versions of libnuma that interfere with what you're trying to
do? You keep coming back to this "whether the BIOS is set or unset" for
L3 as NUMA, but I'm still unclear as to what issues your patch is
solving assuming "knob is set". When the system is configured correctly,
it already works and reports cores as part of NUMA nodes (as L3)
correctly. It is only when the system is configured *not* to do that
that issues arise, is it not? In which case IMO the easier solution
would be to just tell the user to enable that knob in BIOS?
These are covered in the tuning gudie for the SoC in 12. How to get best
performance on AMD platform — Data Plane Development Kit 24.07.0
documentation (dpdk.org)
<https://doc.dpdk.org/guides/linux_gsg/amd_platform.html>.
Because if it does, I don't really think any changes are
required because NUMA nodes would give you the same thing, would it
not?
I have a different opinion to this outlook. An end user can
1. Identify the lcores and it's NUMA user `usertools/cpu-layout.py`
I recently submitted an enhacement for CPU layout script to print out
NUMA separately from physical socket [1].
[1]
https://patches.dpdk.org/project/dpdk/patch/40cf4ee32f15952457ac5526cfce64728bd13d32.1724323106.git.anatoly.bura...@intel.com/
I believe when "L3 as NUMA" is enabled in BIOS, the script will display
both physical package ID as well as NUMA nodes reported by the system,
which will be different from physical package ID, and which will display
information you were looking for.
As AMD we had submitted earlier work on the same via usertools: enhance
logic to display NUMA - Patchwork (dpdk.org)
<https://patchwork.dpdk.org/project/dpdk/patch/20220326073207.489694-1-vipin.vargh...@amd.com/>.
this clearly were distinguishing NUMA and Physical socket.
Oh, cool, I didn't see that patch. I would argue my visual format is
more readable though, so perhaps we can get that in :)
Agreed, but as pointed out in case of Intel Xeon Platinum SPR, the tile
consists of cpu, memory, pcie and accelerator.
hence setting the BIOS option `Cluster per NUMA` the OS kernel & libnuma
display appropriate Domain with memory, pcie and cpu.
In case of AMD SoC, libnuma for CPU is different from memory NUMA per
socket.
I'm curious how does the kernel handle this then, and what are you
getting from libnuma. You seem to be implying that there are two
different NUMA nodes on your SoC, and either kernel or libnuma are in
conflict as to what belongs to what NUMA node?
3. there are no API which distinguish L3 numa domain. Function
`rte_socket_id
<https://doc.dpdk.org/api/rte__lcore_8h.html#a7c8da4664df26a64cf05dc508a4f26df>`
for CPU tiles like AMD SoC will return physical socket.
Sure, but I would think the answer to that would be to introduce an API
to distinguish between NUMA (socket ID in DPDK parlance) and package
(physical socket ID in the "traditional NUMA" sense). Once we can
distinguish between those, DPDK can just rely on NUMA information
provided by the OS, while still being capable of identifying physical
sockets if the user so desires.
Agreed, +1 for the idea for physcial socket and changes in library to
exploit the same.
I am actually going to introduce API to get *physical socket* (as
opposed to NUMA node) in the next few days.
But how does it solve the end customer issues
1. if there are multiple NIC or Accelerator on multiple socket, but IO
tile is partitioned to Sub Domain.
At least on Intel platforms, NUMA node gets assigned correctly - that
is, if my Xeon with SNC enabled has NUMA nodes 3,4 on socket 1, and
there's a NIC connected to socket 1, it's going to show up as being on
NUMA node 3 or 4 depending on where exactly I plugged it in. Everything
already works as expected, and there is no need for any changes for
Intel platforms (at least none that I can see).
My proposed API is really for those users who wish to explicitly allow
for reserving memory/cores on "the same physical socket", as "on the
same tile" is already taken care of by NUMA nodes.
2. If RTE_FLOW steering is applied on NIC which needs to processed under
same L3 - reduces noisy neighbor and better cache hits
3, for PKT-distribute library which needs to run within same worker
lcore set as RX-Distributor-TX.
Same as above: on Intel platforms, NUMA nodes already solve this.
<snip>
Totally agree, that is what the RFC is also doing, based on what OS sees
as NUMA we are using it.
Only addition is within the NUMA if there are split LLC, allow selection
of those lcores. Rather than blindly choosing lcore using
rte_lcore_get_next.
It feels like we're working around a problem that shouldn't exist in the
first place, because kernel should already report this information.
Within NUMA subsystem, there is sysfs node "distance" that, at least on
Intel platforms and in certain BIOS configuration, reports distance
between NUMA nodes, from which one can make inferences about how far a
specific NUMA node is from any other NUMA node. This could have been
used to encode L3 cache information. Do AMD platforms not do that? In
that case, "lcore next" for a particular socket ID (NUMA node, in
reality) should already get us any cores that are close to each other,
because all of this information is already encoded in NUMA nodes by the
system.
I feel like there's a disconnect between my understanding of the problem
space, and yours, so I'm going to ask a very basic question:
Assuming the user has configured their AMD system correctly (i.e.
enabled L3 as NUMA), are there any problem to be solved by adding a new
API? Does the system not report each L3 as a separate NUMA node?
We force the user to configure their system
correctly as it is, and I see no reason to second-guess user's BIOS
configuration otherwise.
Again iterating, the changes suggested in RFC are agnostic to what BIOS
options are used,
But that is exactly my contention: are we not effectively working around
users' misconfiguration of a system then?
--
Thanks,
Anatoly