> On Sep 11, 2024, at 10:55 AM, Mattias Rönnblom <hof...@lysator.liu.se> wrote: > > On 2024-09-11 05:26, Varghese, Vipin wrote: >> [AMD Official Use Only - AMD Internal Distribution Only] >> <snipped> >>> >>> On 2024-09-09 16:22, Varghese, Vipin wrote: >>>> [AMD Official Use Only - AMD Internal Distribution Only] >>>> >>>> <snipped> >>>> >>>>>> <snipped> >>>>>> >>>>>> Thank you Mattias for the comments and question, please let me try >>>>>> to explain the same below >>>>>> >>>>>>> We shouldn't have a separate CPU/cache hierarchy API instead? >>>>>> >>>>>> Based on the intention to bring in CPU lcores which share same L3 >>>>>> (for better cache hits and less noisy neighbor) current API focuses >>>>>> on using >>>>>> >>>>>> Last Level Cache. But if the suggestion is `there are SoC where L2 >>>>>> cache are also shared, and the new API should be provisioned`, I am >>>>>> also >>>>>> >>>>>> comfortable with the thought. >>>>>> >>>>> >>>>> Rather than some AMD special case API hacked into <rte_lcore.h>, I >>>>> think we are better off with no DPDK API at all for this kind of >>>>> functionality. >>>> >>>> Hi Mattias, as shared in the earlier email thread, this is not a AMD >>>> special >>> case at all. Let me try to explain this one more time. One of techniques >>> used to >>> increase cores cost effective way to go for tiles of compute complexes. >>>> This introduces a bunch of cores in sharing same Last Level Cache (namely >>> L2, L3 or even L4) depending upon cache topology architecture. >>>> >>>> The API suggested in RFC is to help end users to selectively use cores >>>> under >>> same Last Level Cache Hierarchy as advertised by OS (irrespective of the >>> BIOS >>> settings used). This is useful in both bare-metal and container environment. >>>> >>> >>> I'm pretty familiar with AMD CPUs and the use of tiles (including the >>> challenges these kinds of non-uniformities pose for work scheduling). >>> >>> To maximize performance, caring about core<->LLC relationship may well not >>> be enough, and more HT/core/cache/memory topology information is >>> required. That's what I meant by special case. A proper API should allow >>> access to information about which lcores are SMT siblings, cores on the same >>> L2, and cores on the same L3, to name a few things. Probably you want to fit >>> NUMA into the same API as well, although that is available already in >>> <rte_lcore.h>. >> Thank you Mattias for the information, as shared by in the reply with >> Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a >> extra argument `u32 flags`. >> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, RTE_GET_LCORE_L3, >> RTE_GET_LCORE_BOOST_ENABLED, RTE_GET_LCORE_BOOST_DISABLED. > > Wouldn't using that API be pretty awkward to use? > > I mean, what you have is a topology, with nodes of different types and with > different properties, and you want to present it to the user. > > In a sense, it's similar to XCM and DOM versus SAX. The above is SAX-style, > and what I have in mind is something DOM-like. > > What use case do you have in mind? What's on top of my list is a scenario > where a DPDK app gets a bunch of cores (e.g., -l <cores>) and tries to figure > out how best make use of them. It's not going to "skip" (ignore, leave > unused) SMT siblings, or skip non-boosted cores, it would just try to be > clever in regards to which cores to use for what purpose. > >> This is AMD EPYC SoC agnostic and trying to address for all generic cases. >> Please do let us know if we (Ferruh & myself) can sync up via call? > > Sure, I can do that. > Can this be opened to the rest of the community? This is a common problem that needs to be solved for multiple architectures. I would be interested in attending.
>>> >>> One can have a look at how scheduling domains work in the Linux kernel. >>> They model this kind of thing. >>> >>>> As shared in response for cover letter +1 to expand it to more than >>>> just LLC cores. We have also confirmed the same to >>>> https://patchwork.dpdk.org/project/dpdk/cover/20240827151014.201- >>> 1-vip >>>> in.vargh...@amd.com/ >>>> >>>>> >>>>> A DPDK CPU/memory hierarchy topology API very much makes sense, but >>>>> it should be reasonably generic and complete from the start. >>>>> >>>>>>> >>>>>>> Could potentially be built on the 'hwloc' library. >>>>>> >>>>>> There are 3 reason on AMD SoC we did not explore this path, reasons >>>>>> are >>>>>> >>>>>> 1. depending n hwloc version and kernel version certain SoC >>>>>> hierarchies are not available >>>>>> >>>>>> 2. CPU NUMA and IO (memory & PCIe) NUMA are independent on AMD >>>>> Epyc Soc. >>>>>> >>>>>> 3. adds the extra dependency layer of library layer to be made >>>>>> available to work. >>>>>> >>>>>> >>>>>> hence we have tried to use Linux Documented generic layer of `sysfs >>>>>> CPU cache`. >>>>>> >>>>>> I will try to explore more on hwloc and check if other libraries >>>>>> within DPDK leverages the same. >>>>>> >>>>>>> >>>>>>> I much agree cache/core topology may be of interest of the >>>>>>> application (or a work scheduler, like a DPDK event device), but >>>>>>> it's not limited to LLC. It may well be worthwhile to care about >>>>>>> which cores shares L2 cache, for example. Not sure the >>>>>>> RTE_LCORE_FOREACH_* >>>>> approach scales. >>>>>> >>>>>> yes, totally understand as some SoC, multiple lcores shares same L2 >>>>>> cache. >>>>>> >>>>>> >>>>>> Can we rework the API to be rte_get_cache_<function> where user >>>>>> argument is desired lcore index. >>>>>> >>>>>> 1. index-1: SMT threads >>>>>> >>>>>> 2. index-2: threads sharing same L2 cache >>>>>> >>>>>> 3. index-3: threads sharing same L3 cache >>>>>> >>>>>> 4. index-MAX: identify the threads sharing last level cache. >>>>>> >>>>>>> >>>>>>>> < Function: Purpose > >>>>>>>> --------------------- >>>>>>>> - rte_get_llc_first_lcores: Retrieves all the first lcores in >>>>>>>> the shared LLC. >>>>>>>> - rte_get_llc_lcore: Retrieves all lcores that share the LLC. >>>>>>>> - rte_get_llc_n_lcore: Retrieves the first n or skips the first >>>>>>>> n lcores in the shared LLC. >>>>>>>> >>>>>>>> < MACRO: Purpose > >>>>>>>> ------------------ >>>>>>>> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from >>>>>>>> each LLC. >>>>>>>> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first >>>>>>>> worker lcore from each LLC. >>>>>>>> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on >>>>> hint >>>>>>>> (lcore id). >>>>>>>> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from >>> LLC >>>>>>>> while skipping first worker. >>>>>>>> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` >>> lcores >>>>>>>> from each LLC. >>>>>>>> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then >>>>>>>> iterates through reaming lcores in each LLC. >>>>>>>> >>>>>> While the MACRO are simple wrapper invoking appropriate API. can >>>>>> this be worked out in this fashion? >>>>>> >>>>>> <snipped>