RE: [RFC 0/2] introduce LLC aware functions

Varghese, Vipin Tue, 10 Sep 2024 20:26:30 -0700

[AMD Official Use Only - AMD Internal Distribution Only]

<snipped>


>
> On 2024-09-09 16:22, Varghese, Vipin wrote:
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> > <snipped>
> >
> >>> <snipped>
> >>>
> >>> Thank you Mattias for the comments and question, please let me try
> >>> to explain the same below
> >>>
> >>>> We shouldn't have a separate CPU/cache hierarchy API instead?
> >>>
> >>> Based on the intention to bring in CPU lcores which share same L3
> >>> (for better cache hits and less noisy neighbor) current API focuses
> >>> on using
> >>>
> >>> Last Level Cache. But if the suggestion is `there are SoC where L2
> >>> cache are also shared, and the new API should be provisioned`, I am
> >>> also
> >>>
> >>> comfortable with the thought.
> >>>
> >>
> >> Rather than some AMD special case API hacked into <rte_lcore.h>, I
> >> think we are better off with no DPDK API at all for this kind of 
> >> functionality.
> >
> > Hi Mattias, as shared in the earlier email thread, this is not a AMD special
> case at all. Let me try to explain this one more time. One of techniques used 
> to
> increase cores cost effective way to go for tiles of compute complexes.
> > This introduces a bunch of cores in sharing same Last Level Cache (namely
> L2, L3 or even L4) depending upon cache topology architecture.
> >
> > The API suggested in RFC is to help end users to selectively use cores under
> same Last Level Cache Hierarchy as advertised by OS (irrespective of the BIOS
> settings used). This is useful in both bare-metal and container environment.
> >
>
> I'm pretty familiar with AMD CPUs and the use of tiles (including the
> challenges these kinds of non-uniformities pose for work scheduling).
>
> To maximize performance, caring about core<->LLC relationship may well not
> be enough, and more HT/core/cache/memory topology information is
> required. That's what I meant by special case. A proper API should allow
> access to information about which lcores are SMT siblings, cores on the same
> L2, and cores on the same L3, to name a few things. Probably you want to fit
> NUMA into the same API as well, although that is available already in
> <rte_lcore.h>.

Thank you Mattias for the information, as shared by in the reply with Anatoly 
we want expose a new API `rte_get_next_lcore_ex` which intakes a extra argument 
`u32 flags`.
The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, RTE_GET_LCORE_L3, 
RTE_GET_LCORE_BOOST_ENABLED, RTE_GET_LCORE_BOOST_DISABLED.

This is AMD EPYC SoC agnostic and trying to address for all generic cases.

Please do let us know if we (Ferruh & myself) can sync up via call?

>
> One can have a look at how scheduling domains work in the Linux kernel.
> They model this kind of thing.
>
> > As shared in response for cover letter +1 to expand it to more than
> > just LLC cores. We have also confirmed the same to
> > https://patchwork.dpdk.org/project/dpdk/cover/20240827151014.201-
> 1-vip
> > in.vargh...@amd.com/
> >
> >>
> >> A DPDK CPU/memory hierarchy topology API very much makes sense, but
> >> it should be reasonably generic and complete from the start.
> >>
> >>>>
> >>>> Could potentially be built on the 'hwloc' library.
> >>>
> >>> There are 3 reason on AMD SoC we did not explore this path, reasons
> >>> are
> >>>
> >>> 1. depending n hwloc version and kernel version certain SoC
> >>> hierarchies are not available
> >>>
> >>> 2. CPU NUMA and IO (memory & PCIe) NUMA are independent on AMD
> >> Epyc Soc.
> >>>
> >>> 3. adds the extra dependency layer of library layer to be made
> >>> available to work.
> >>>
> >>>
> >>> hence we have tried to use Linux Documented generic layer of `sysfs
> >>> CPU cache`.
> >>>
> >>> I will try to explore more on hwloc and check if other libraries
> >>> within DPDK leverages the same.
> >>>
> >>>>
> >>>> I much agree cache/core topology may be of interest of the
> >>>> application (or a work scheduler, like a DPDK event device), but
> >>>> it's not limited to LLC. It may well be worthwhile to care about
> >>>> which cores shares L2 cache, for example. Not sure the
> >>>> RTE_LCORE_FOREACH_*
> >> approach scales.
> >>>
> >>> yes, totally understand as some SoC, multiple lcores shares same L2 cache.
> >>>
> >>>
> >>> Can we rework the API to be rte_get_cache_<function> where user
> >>> argument is desired lcore index.
> >>>
> >>> 1. index-1: SMT threads
> >>>
> >>> 2. index-2: threads sharing same L2 cache
> >>>
> >>> 3. index-3: threads sharing same L3 cache
> >>>
> >>> 4. index-MAX: identify the threads sharing last level cache.
> >>>
> >>>>
> >>>>> < Function: Purpose >
> >>>>> ---------------------
> >>>>>    - rte_get_llc_first_lcores: Retrieves all the first lcores in
> >>>>> the shared LLC.
> >>>>>    - rte_get_llc_lcore: Retrieves all lcores that share the LLC.
> >>>>>    - rte_get_llc_n_lcore: Retrieves the first n or skips the first
> >>>>> n lcores in the shared LLC.
> >>>>>
> >>>>> < MACRO: Purpose >
> >>>>> ------------------
> >>>>> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from
> >>>>> each LLC.
> >>>>> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first
> >>>>> worker lcore from each LLC.
> >>>>> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on
> >> hint
> >>>>> (lcore id).
> >>>>> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from
> LLC
> >>>>> while skipping first worker.
> >>>>> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n`
> lcores
> >>>>> from each LLC.
> >>>>> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then
> >>>>> iterates through reaming lcores in each LLC.
> >>>>>
> >>> While the MACRO are simple wrapper invoking appropriate API. can
> >>> this be worked out in this fashion?
> >>>
> >>> <snipped>

RE: [RFC 0/2] introduce LLC aware functions

Reply via email to