On 2024-09-11 05:26, Varghese, Vipin wrote:
[AMD Official Use Only - AMD Internal Distribution Only]

<snipped>


On 2024-09-09 16:22, Varghese, Vipin wrote:
[AMD Official Use Only - AMD Internal Distribution Only]

<snipped>

<snipped>

Thank you Mattias for the comments and question, please let me try
to explain the same below

We shouldn't have a separate CPU/cache hierarchy API instead?

Based on the intention to bring in CPU lcores which share same L3
(for better cache hits and less noisy neighbor) current API focuses
on using

Last Level Cache. But if the suggestion is `there are SoC where L2
cache are also shared, and the new API should be provisioned`, I am
also

comfortable with the thought.


Rather than some AMD special case API hacked into <rte_lcore.h>, I
think we are better off with no DPDK API at all for this kind of functionality.

Hi Mattias, as shared in the earlier email thread, this is not a AMD special
case at all. Let me try to explain this one more time. One of techniques used to
increase cores cost effective way to go for tiles of compute complexes.
This introduces a bunch of cores in sharing same Last Level Cache (namely
L2, L3 or even L4) depending upon cache topology architecture.

The API suggested in RFC is to help end users to selectively use cores under
same Last Level Cache Hierarchy as advertised by OS (irrespective of the BIOS
settings used). This is useful in both bare-metal and container environment.


I'm pretty familiar with AMD CPUs and the use of tiles (including the
challenges these kinds of non-uniformities pose for work scheduling).

To maximize performance, caring about core<->LLC relationship may well not
be enough, and more HT/core/cache/memory topology information is
required. That's what I meant by special case. A proper API should allow
access to information about which lcores are SMT siblings, cores on the same
L2, and cores on the same L3, to name a few things. Probably you want to fit
NUMA into the same API as well, although that is available already in
<rte_lcore.h>.

Thank you Mattias for the information, as shared by in the reply with Anatoly 
we want expose a new API `rte_get_next_lcore_ex` which intakes a extra argument 
`u32 flags`.
The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, RTE_GET_LCORE_L3, 
RTE_GET_LCORE_BOOST_ENABLED, RTE_GET_LCORE_BOOST_DISABLED.


Wouldn't using that API be pretty awkward to use?

I mean, what you have is a topology, with nodes of different types and with different properties, and you want to present it to the user.

In a sense, it's similar to XCM and DOM versus SAX. The above is SAX-style, and what I have in mind is something DOM-like.

What use case do you have in mind? What's on top of my list is a scenario where a DPDK app gets a bunch of cores (e.g., -l <cores>) and tries to figure out how best make use of them. It's not going to "skip" (ignore, leave unused) SMT siblings, or skip non-boosted cores, it would just try to be clever in regards to which cores to use for what purpose.

This is AMD EPYC SoC agnostic and trying to address for all generic cases.

Please do let us know if we (Ferruh & myself) can sync up via call?


Sure, I can do that.


One can have a look at how scheduling domains work in the Linux kernel.
They model this kind of thing.

As shared in response for cover letter +1 to expand it to more than
just LLC cores. We have also confirmed the same to
https://patchwork.dpdk.org/project/dpdk/cover/20240827151014.201-
1-vip
in.vargh...@amd.com/


A DPDK CPU/memory hierarchy topology API very much makes sense, but
it should be reasonably generic and complete from the start.


Could potentially be built on the 'hwloc' library.

There are 3 reason on AMD SoC we did not explore this path, reasons
are

1. depending n hwloc version and kernel version certain SoC
hierarchies are not available

2. CPU NUMA and IO (memory & PCIe) NUMA are independent on AMD
Epyc Soc.

3. adds the extra dependency layer of library layer to be made
available to work.


hence we have tried to use Linux Documented generic layer of `sysfs
CPU cache`.

I will try to explore more on hwloc and check if other libraries
within DPDK leverages the same.


I much agree cache/core topology may be of interest of the
application (or a work scheduler, like a DPDK event device), but
it's not limited to LLC. It may well be worthwhile to care about
which cores shares L2 cache, for example. Not sure the
RTE_LCORE_FOREACH_*
approach scales.

yes, totally understand as some SoC, multiple lcores shares same L2 cache.


Can we rework the API to be rte_get_cache_<function> where user
argument is desired lcore index.

1. index-1: SMT threads

2. index-2: threads sharing same L2 cache

3. index-3: threads sharing same L3 cache

4. index-MAX: identify the threads sharing last level cache.


< Function: Purpose >
---------------------
    - rte_get_llc_first_lcores: Retrieves all the first lcores in
the shared LLC.
    - rte_get_llc_lcore: Retrieves all lcores that share the LLC.
    - rte_get_llc_n_lcore: Retrieves the first n or skips the first
n lcores in the shared LLC.

< MACRO: Purpose >
------------------
RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from
each LLC.
RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first
worker lcore from each LLC.
RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on
hint
(lcore id).
RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from
LLC
while skipping first worker.
RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n`
lcores
from each LLC.
RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then
iterates through reaming lcores in each LLC.

While the MACRO are simple wrapper invoking appropriate API. can
this be worked out in this fashion?

<snipped>

Reply via email to