On 2024-09-12 03:33, Varghese, Vipin wrote:
[Public]

Snipped


<snipped>

<snipped>

Thank you Mattias for the comments and question, please let me
try to explain the same below

We shouldn't have a separate CPU/cache hierarchy API instead?

Based on the intention to bring in CPU lcores which share same L3
(for better cache hits and less noisy neighbor) current API
focuses on using

Last Level Cache. But if the suggestion is `there are SoC where
L2 cache are also shared, and the new API should be provisioned`,
I am also

comfortable with the thought.


Rather than some AMD special case API hacked into <rte_lcore.h>, I
think we are better off with no DPDK API at all for this kind of
functionality.

Hi Mattias, as shared in the earlier email thread, this is not a
AMD special
case at all. Let me try to explain this one more time. One of
techniques used to increase cores cost effective way to go for tiles of
compute complexes.
This introduces a bunch of cores in sharing same Last Level Cache
(namely
L2, L3 or even L4) depending upon cache topology architecture.

The API suggested in RFC is to help end users to selectively use
cores under
same Last Level Cache Hierarchy as advertised by OS (irrespective of
the BIOS settings used). This is useful in both bare-metal and container
environment.


I'm pretty familiar with AMD CPUs and the use of tiles (including
the challenges these kinds of non-uniformities pose for work scheduling).

To maximize performance, caring about core<->LLC relationship may
well not be enough, and more HT/core/cache/memory topology
information is required. That's what I meant by special case. A
proper API should allow access to information about which lcores are
SMT siblings, cores on the same L2, and cores on the same L3, to
name a few things. Probably you want to fit NUMA into the same API
as well, although that is available already in <rte_lcore.h>.
Thank you Mattias for the information, as shared by in the reply with
Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a
extra argument `u32 flags`.
The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2,
RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED,
RTE_GET_LCORE_BOOST_DISABLED.

Wouldn't using that API be pretty awkward to use?
Current API available under DPDK is ` rte_get_next_lcore`, which is used within 
DPDK example and in customer solution.
Based on the comments from others we responded to the idea of changing the new 
Api from ` rte_get_next_lcore_llc` to ` rte_get_next_lcore_exntd`.

Can you please help us understand what is `awkward`.


The awkwardness starts when you are trying to fit provide hwloc type information over an API that was designed for iterating over lcores.

It seems to me that you should either have:
A) An API in similar to that of hwloc (or any DOM-like API), which would give a low-level description of the hardware in implementation terms. The topology would consist of nodes, with attributes, etc, where nodes are things like cores or instances of caches of some level and attributes are things like CPU actual and nominal, and maybe max frequency, cache size, or memory size.
or
B) An API to be directly useful for a work scheduler, in which case you should abstract away things like "boost" (and fold them into some abstract capacity notion, together with core "size" [in big-little/heterogeneous systems]), and have an abstract notion of what core is "close" to some other core. This would something like Linux' scheduling domains.

If you want B you probably need A as a part of its implementation, so you may just as well start with A, I suppose.

What you could do to explore the API design is to add support for, for example, boost core awareness or SMT affinity in the SW scheduler. You could also do an "lstopo" equivalent, since that's needed for debugging and exploration, if nothing else.

One question that will have to be answered in a work scheduling scenario is "are these two lcores SMT siblings," or "are these two cores on the same LLC", or "give me all lcores on a particular L2 cache".


I mean, what you have is a topology, with nodes of different types and with
different properties, and you want to present it to the user.
Let me be clear, what we want via DPDK to help customer to use an Unified API 
which works across multiple platforms.
Example - let a vendor have 2 products namely A and B. CPU-A has all cores 
within same SUB-NUMA domain and CPU-B has cores split to 2 sub-NUMA domain 
based on split LLC.
When `rte_get_next_lcore_extnd` is invoked for `LLC` on
1. CPU-A: it returns all cores as there is no split
2. CPU-B: it returns cores from specific sub-NUMA which is partitioned by L3


I think the function name rte_get_next_lcore_extnd() alone makes clear this is an awkward API. :)

My gut feeling is to make it more explicit and forget about <rte_lcore.h>. <rte_hwtopo.h>? Could and should still be EAL.


In a sense, it's similar to XCM and DOM versus SAX. The above is SAX-style,
and what I have in mind is something DOM-like.

What use case do you have in mind? What's on top of my list is a scenario
where a DPDK app gets a bunch of cores (e.g., -l <cores>) and tries to figure
out how best make use of them.
Exactly.

  It's not going to "skip" (ignore, leave unused)
SMT siblings, or skip non-boosted cores, it would just try to be clever in
regards to which cores to use for what purpose.
Let me try to share my idea on SMT sibling. When user invoked for 
rte_get_next_lcore_extnd` is invoked for `L1 | SMT` flag with `lcore`; the API 
identifies first whether given lcore is part of enabled core list.
If yes, it programmatically either using `sysfs` or `hwloc library (shared the 
version concern on distros. Will recheck again)` identify the sibling thread 
and return.
If there is no sibling thread available under DPDK it will fetch next lcore 
(probably lcore +1 ).


Distributions having old hwloc versions isn't an argument for a new DPDK library or new API. If only that was the issue, then it would be better to help the hwloc and/or distributions, rather than the DPDK project.


This is AMD EPYC SoC agnostic and trying to address for all generic cases.
Please do let us know if we (Ferruh & myself) can sync up via call?

Sure, I can do that.

Let me sync with Ferruh and get a time slot for internal sync.


Can this be opened to the rest of the community? This is a common problem
that needs to be solved for multiple architectures. I would be interested in
attending.
Thank you Mattias, in DPDK Bangkok summit 2024 we did bring this up. As per the 
suggestion from Thomas and Jerrin we tried to bring the RFC for discussion.
For DPDK Montreal 2024, Keesang and Ferruh (most likely) is travelling for the 
summit and presenting this as the talk to get things moving.



<snipped>

Reply via email to