[AMD Official Use Only - AMD Internal Distribution Only] <snipped>
> > On 2024-09-09 16:22, Varghese, Vipin wrote: > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > <snipped> > > > >>> <snipped> > >>> > >>> Thank you Mattias for the comments and question, please let me try > >>> to explain the same below > >>> > >>>> We shouldn't have a separate CPU/cache hierarchy API instead? > >>> > >>> Based on the intention to bring in CPU lcores which share same L3 > >>> (for better cache hits and less noisy neighbor) current API focuses > >>> on using > >>> > >>> Last Level Cache. But if the suggestion is `there are SoC where L2 > >>> cache are also shared, and the new API should be provisioned`, I am > >>> also > >>> > >>> comfortable with the thought. > >>> > >> > >> Rather than some AMD special case API hacked into <rte_lcore.h>, I > >> think we are better off with no DPDK API at all for this kind of > >> functionality. > > > > Hi Mattias, as shared in the earlier email thread, this is not a AMD special > case at all. Let me try to explain this one more time. One of techniques used > to > increase cores cost effective way to go for tiles of compute complexes. > > This introduces a bunch of cores in sharing same Last Level Cache (namely > L2, L3 or even L4) depending upon cache topology architecture. > > > > The API suggested in RFC is to help end users to selectively use cores under > same Last Level Cache Hierarchy as advertised by OS (irrespective of the BIOS > settings used). This is useful in both bare-metal and container environment. > > > > I'm pretty familiar with AMD CPUs and the use of tiles (including the > challenges these kinds of non-uniformities pose for work scheduling). > > To maximize performance, caring about core<->LLC relationship may well not > be enough, and more HT/core/cache/memory topology information is > required. That's what I meant by special case. A proper API should allow > access to information about which lcores are SMT siblings, cores on the same > L2, and cores on the same L3, to name a few things. Probably you want to fit > NUMA into the same API as well, although that is available already in > <rte_lcore.h>. Thank you Mattias for the information, as shared by in the reply with Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a extra argument `u32 flags`. The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, RTE_GET_LCORE_BOOST_DISABLED. This is AMD EPYC SoC agnostic and trying to address for all generic cases. Please do let us know if we (Ferruh & myself) can sync up via call? > > One can have a look at how scheduling domains work in the Linux kernel. > They model this kind of thing. > > > As shared in response for cover letter +1 to expand it to more than > > just LLC cores. We have also confirmed the same to > > https://patchwork.dpdk.org/project/dpdk/cover/20240827151014.201- > 1-vip > > in.vargh...@amd.com/ > > > >> > >> A DPDK CPU/memory hierarchy topology API very much makes sense, but > >> it should be reasonably generic and complete from the start. > >> > >>>> > >>>> Could potentially be built on the 'hwloc' library. > >>> > >>> There are 3 reason on AMD SoC we did not explore this path, reasons > >>> are > >>> > >>> 1. depending n hwloc version and kernel version certain SoC > >>> hierarchies are not available > >>> > >>> 2. CPU NUMA and IO (memory & PCIe) NUMA are independent on AMD > >> Epyc Soc. > >>> > >>> 3. adds the extra dependency layer of library layer to be made > >>> available to work. > >>> > >>> > >>> hence we have tried to use Linux Documented generic layer of `sysfs > >>> CPU cache`. > >>> > >>> I will try to explore more on hwloc and check if other libraries > >>> within DPDK leverages the same. > >>> > >>>> > >>>> I much agree cache/core topology may be of interest of the > >>>> application (or a work scheduler, like a DPDK event device), but > >>>> it's not limited to LLC. It may well be worthwhile to care about > >>>> which cores shares L2 cache, for example. Not sure the > >>>> RTE_LCORE_FOREACH_* > >> approach scales. > >>> > >>> yes, totally understand as some SoC, multiple lcores shares same L2 cache. > >>> > >>> > >>> Can we rework the API to be rte_get_cache_<function> where user > >>> argument is desired lcore index. > >>> > >>> 1. index-1: SMT threads > >>> > >>> 2. index-2: threads sharing same L2 cache > >>> > >>> 3. index-3: threads sharing same L3 cache > >>> > >>> 4. index-MAX: identify the threads sharing last level cache. > >>> > >>>> > >>>>> < Function: Purpose > > >>>>> --------------------- > >>>>> - rte_get_llc_first_lcores: Retrieves all the first lcores in > >>>>> the shared LLC. > >>>>> - rte_get_llc_lcore: Retrieves all lcores that share the LLC. > >>>>> - rte_get_llc_n_lcore: Retrieves the first n or skips the first > >>>>> n lcores in the shared LLC. > >>>>> > >>>>> < MACRO: Purpose > > >>>>> ------------------ > >>>>> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from > >>>>> each LLC. > >>>>> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first > >>>>> worker lcore from each LLC. > >>>>> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on > >> hint > >>>>> (lcore id). > >>>>> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from > LLC > >>>>> while skipping first worker. > >>>>> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` > lcores > >>>>> from each LLC. > >>>>> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then > >>>>> iterates through reaming lcores in each LLC. > >>>>> > >>> While the MACRO are simple wrapper invoking appropriate API. can > >>> this be worked out in this fashion? > >>> > >>> <snipped>