On 9/5/2024 3:05 PM, Ferruh Yigit wrote:
On 9/3/2024 9:50 AM, Burakov, Anatoly wrote:
On 9/2/2024 5:33 PM, Varghese, Vipin wrote:
<snipped>
Hi Ferruh,
I feel like there's a disconnect between my understanding of the problem
space, and yours, so I'm going to ask a very basic question:
Assuming the user has configured their AMD system correctly (i.e.
enabled L3 as NUMA), are there any problem to be solved by adding a new
API? Does the system not report each L3 as a separate NUMA node?
Hi Anatoly,
Let me try to answer.
To start with, Intel "Sub-NUMA Clustering" and AMD NUMA is different, as
far as I understand SNC is more similar to more classic physical socket
based NUMA.
Following is the AMD CPU:
┌─────┐┌─────┐┌──────────┐┌─────┐┌─────┐
│ ││ ││ ││ ││ │
│ ││ ││ ││ ││ │
│TILE1││TILE2││ ││TILE5││TILE6│
│ ││ ││ ││ ││ │
│ ││ ││ ││ ││ │
│ ││ ││ ││ ││ │
└─────┘└─────┘│ IO │└─────┘└─────┘
┌─────┐┌─────┐│ TILE │┌─────┐┌─────┐
│ ││ ││ ││ ││ │
│ ││ ││ ││ ││ │
│TILE3││TILE4││ ││TILE7││TILE8│
│ ││ ││ ││ ││ │
│ ││ ││ ││ ││ │
│ ││ ││ ││ ││ │
└─────┘└─────┘└──────────┘└─────┘└─────┘
Each 'Tile' has multiple cores, and 'IO Tile' has memory controller, bus
controllers etc..
When NPS=x configured in bios, IO tile resources are split and each seen
as a NUMA node.
Following is NPS=4
┌─────┐┌─────┐┌──────────┐┌─────┐┌─────┐
│ ││ ││ . ││ ││ │
│ ││ ││ . ││ ││ │
│TILE1││TILE2││ . ││TILE5││TILE6│
│ ││ ││NUMA .NUMA││ ││ │
│ ││ ││ 0 . 1 ││ ││ │
│ ││ ││ . ││ ││ │
└─────┘└─────┘│ . │└─────┘└─────┘
┌─────┐┌─────┐│..........│┌─────┐┌─────┐
│ ││ ││ . ││ ││ │
│ ││ ││NUMA .NUMA││ ││ │
│TILE3││TILE4││ 2 . 3 ││TILE7││TILE8│
│ ││ ││ . ││ ││ │
│ ││ ││ . ││ ││ │
│ ││ ││ . ││ ││ │
└─────┘└─────┘└─────.────┘└─────┘└─────┘
Benefit of this is approach is now all cores has to access all NUMA
without any penalty. Like a DPDK application can use cores from 'TILE1',
'TILE4' & 'TILE7' to access to NUMA0 (or any NUMA) resources in high
performance.
This is different than SNC where cores access to cross NUMA resources
hit by performance penalty.
Now, although which tile cores come from doesn't matter from NUMA
perspective, it may matter (based on workload) to have them under same LLC.
One way to make sure all cores are under same LLC, is to enable "L3 as
NUMA" BIOS option, which will make each TILE shown as a different NUMA,
and user select cores from one NUMA.
This is sufficient up to some point, but not enough when application
needs number of cores that uses multiple tiles.
Assume each tile has 8 cores, and application needs 24 cores, when user
provide all cores from TILE1, TILE2 & TILE3, in DPDK right now there is
now way for application to figure out how to group/select these cores to
use cores efficiently.
Indeed this is what Vipin is enabling, from a core, he is finding list
of cores that will work efficiently with this core. In this perspective
this is nothing really related to NUMA configuration, and nothing really
specific to AMD, as defined Linux sysfs interface is used for this.
There are other architectures around that has similar NUMA configuration
and they can also use same logic, at worst we can introduce an
architecture specific code that all architectures can have a way to find
other cores that works more efficient with given core. This is a useful
feature for DPDK.
Lets looks into another example, application uses 24 cores in an graph
library like usage, that we want to group each three cores to process a
graph node. Application needs to a way to select which three cores works
most efficient with eachother, that is what this patch enables. In this
case enabling "L3 as NUMA" does not help at all. With this patch both
bios config works, but of course user should select cores to provide
application based on configuration.
And we can even improve this effective core selection, like as Mattias
suggested we can select cores that share L2 caches, with expansion of
this patch. This is unrelated to NUMA, and again it is not introducing
architecture details to DPDK as this implementation already relies on
Linux sysfs interface.
I hope it clarifies a little more.
Thanks,
ferruh
Yes, this does help clarify things a lot as to why current NUMA support
would be insufficient to express what you are describing.
However, in that case I would echo sentiment others have expressed
already as this kind of deep sysfs parsing doesn't seem like it would be
in scope for EAL, it sounds more like something a sysadmin/orchestration
(or the application itself) would do.
I mean, in principle I'm not opposed to having such an API, it just
seems like the abstraction would perhaps need to be a bit more robust
than directly referencing cache structure? Maybe something that
degenerates into NUMA nodes would be better, so that applications
wouldn't have to *specifically* worry about cache locality but instead
have a more generic API they can use to group cores together?
--
Thanks,
Anatoly