On 2024-09-12 03:33, Varghese, Vipin wrote:
[Public]
Snipped
<snipped>
<snipped>
Thank you Mattias for the comments and question, please let me
try to explain the same below
We shouldn't have a separate CPU/cache hierarchy API instead?
Based on the intention to bring in CPU lcores which share same L3
(for better cache hits and less noisy neighbor) current API
focuses on using
Last Level Cache. But if the suggestion is `there are SoC where
L2 cache are also shared, and the new API should be provisioned`,
I am also
comfortable with the thought.
Rather than some AMD special case API hacked into <rte_lcore.h>, I
think we are better off with no DPDK API at all for this kind of
functionality.
Hi Mattias, as shared in the earlier email thread, this is not a
AMD special
case at all. Let me try to explain this one more time. One of
techniques used to increase cores cost effective way to go for tiles of
compute complexes.
This introduces a bunch of cores in sharing same Last Level Cache
(namely
L2, L3 or even L4) depending upon cache topology architecture.
The API suggested in RFC is to help end users to selectively use
cores under
same Last Level Cache Hierarchy as advertised by OS (irrespective of
the BIOS settings used). This is useful in both bare-metal and container
environment.
I'm pretty familiar with AMD CPUs and the use of tiles (including
the challenges these kinds of non-uniformities pose for work scheduling).
To maximize performance, caring about core<->LLC relationship may
well not be enough, and more HT/core/cache/memory topology
information is required. That's what I meant by special case. A
proper API should allow access to information about which lcores are
SMT siblings, cores on the same L2, and cores on the same L3, to
name a few things. Probably you want to fit NUMA into the same API
as well, although that is available already in <rte_lcore.h>.
Thank you Mattias for the information, as shared by in the reply with
Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a
extra argument `u32 flags`.
The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2,
RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED,
RTE_GET_LCORE_BOOST_DISABLED.
Wouldn't using that API be pretty awkward to use?
Current API available under DPDK is ` rte_get_next_lcore`, which is used within
DPDK example and in customer solution.
Based on the comments from others we responded to the idea of changing the new
Api from ` rte_get_next_lcore_llc` to ` rte_get_next_lcore_exntd`.
Can you please help us understand what is `awkward`.
The awkwardness starts when you are trying to fit provide hwloc type
information over an API that was designed for iterating over lcores.
It seems to me that you should either have:
A) An API in similar to that of hwloc (or any DOM-like API), which would
give a low-level description of the hardware in implementation terms.
The topology would consist of nodes, with attributes, etc, where nodes
are things like cores or instances of caches of some level and
attributes are things like CPU actual and nominal, and maybe max
frequency, cache size, or memory size.
or
B) An API to be directly useful for a work scheduler, in which case you
should abstract away things like "boost" (and fold them into some
abstract capacity notion, together with core "size" [in
big-little/heterogeneous systems]), and have an abstract notion of what
core is "close" to some other core. This would something like Linux'
scheduling domains.
If you want B you probably need A as a part of its implementation, so
you may just as well start with A, I suppose.
What you could do to explore the API design is to add support for, for
example, boost core awareness or SMT affinity in the SW scheduler. You
could also do an "lstopo" equivalent, since that's needed for debugging
and exploration, if nothing else.
One question that will have to be answered in a work scheduling scenario
is "are these two lcores SMT siblings," or "are these two cores on the
same LLC", or "give me all lcores on a particular L2 cache".
I mean, what you have is a topology, with nodes of different types and with
different properties, and you want to present it to the user.
Let me be clear, what we want via DPDK to help customer to use an Unified API
which works across multiple platforms.
Example - let a vendor have 2 products namely A and B. CPU-A has all cores
within same SUB-NUMA domain and CPU-B has cores split to 2 sub-NUMA domain
based on split LLC.
When `rte_get_next_lcore_extnd` is invoked for `LLC` on
1. CPU-A: it returns all cores as there is no split
2. CPU-B: it returns cores from specific sub-NUMA which is partitioned by L3
I think the function name rte_get_next_lcore_extnd() alone makes clear
this is an awkward API. :)
My gut feeling is to make it more explicit and forget about
<rte_lcore.h>. <rte_hwtopo.h>? Could and should still be EAL.
In a sense, it's similar to XCM and DOM versus SAX. The above is SAX-style,
and what I have in mind is something DOM-like.
What use case do you have in mind? What's on top of my list is a scenario
where a DPDK app gets a bunch of cores (e.g., -l <cores>) and tries to figure
out how best make use of them.
Exactly.
It's not going to "skip" (ignore, leave unused)
SMT siblings, or skip non-boosted cores, it would just try to be clever in
regards to which cores to use for what purpose.
Let me try to share my idea on SMT sibling. When user invoked for
rte_get_next_lcore_extnd` is invoked for `L1 | SMT` flag with `lcore`; the API
identifies first whether given lcore is part of enabled core list.
If yes, it programmatically either using `sysfs` or `hwloc library (shared the
version concern on distros. Will recheck again)` identify the sibling thread
and return.
If there is no sibling thread available under DPDK it will fetch next lcore
(probably lcore +1 ).
Distributions having old hwloc versions isn't an argument for a new DPDK
library or new API. If only that was the issue, then it would be better
to help the hwloc and/or distributions, rather than the DPDK project.
This is AMD EPYC SoC agnostic and trying to address for all generic cases.
Please do let us know if we (Ferruh & myself) can sync up via call?
Sure, I can do that.
Let me sync with Ferruh and get a time slot for internal sync.
Can this be opened to the rest of the community? This is a common problem
that needs to be solved for multiple architectures. I would be interested in
attending.
Thank you Mattias, in DPDK Bangkok summit 2024 we did bring this up. As per the
suggestion from Thomas and Jerrin we tried to bring the RFC for discussion.
For DPDK Montreal 2024, Keesang and Ferruh (most likely) is travelling for the
summit and presenting this as the talk to get things moving.
<snipped>