> On Sep 11, 2024, at 10:55 AM, Mattias Rönnblom <hof...@lysator.liu.se> wrote:
> 
> On 2024-09-11 05:26, Varghese, Vipin wrote:
>> [AMD Official Use Only - AMD Internal Distribution Only]
>> <snipped>
>>> 
>>> On 2024-09-09 16:22, Varghese, Vipin wrote:
>>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>> 
>>>> <snipped>
>>>> 
>>>>>> <snipped>
>>>>>> 
>>>>>> Thank you Mattias for the comments and question, please let me try
>>>>>> to explain the same below
>>>>>> 
>>>>>>> We shouldn't have a separate CPU/cache hierarchy API instead?
>>>>>> 
>>>>>> Based on the intention to bring in CPU lcores which share same L3
>>>>>> (for better cache hits and less noisy neighbor) current API focuses
>>>>>> on using
>>>>>> 
>>>>>> Last Level Cache. But if the suggestion is `there are SoC where L2
>>>>>> cache are also shared, and the new API should be provisioned`, I am
>>>>>> also
>>>>>> 
>>>>>> comfortable with the thought.
>>>>>> 
>>>>> 
>>>>> Rather than some AMD special case API hacked into <rte_lcore.h>, I
>>>>> think we are better off with no DPDK API at all for this kind of 
>>>>> functionality.
>>>> 
>>>> Hi Mattias, as shared in the earlier email thread, this is not a AMD 
>>>> special
>>> case at all. Let me try to explain this one more time. One of techniques 
>>> used to
>>> increase cores cost effective way to go for tiles of compute complexes.
>>>> This introduces a bunch of cores in sharing same Last Level Cache (namely
>>> L2, L3 or even L4) depending upon cache topology architecture.
>>>> 
>>>> The API suggested in RFC is to help end users to selectively use cores 
>>>> under
>>> same Last Level Cache Hierarchy as advertised by OS (irrespective of the 
>>> BIOS
>>> settings used). This is useful in both bare-metal and container environment.
>>>> 
>>> 
>>> I'm pretty familiar with AMD CPUs and the use of tiles (including the
>>> challenges these kinds of non-uniformities pose for work scheduling).
>>> 
>>> To maximize performance, caring about core<->LLC relationship may well not
>>> be enough, and more HT/core/cache/memory topology information is
>>> required. That's what I meant by special case. A proper API should allow
>>> access to information about which lcores are SMT siblings, cores on the same
>>> L2, and cores on the same L3, to name a few things. Probably you want to fit
>>> NUMA into the same API as well, although that is available already in
>>> <rte_lcore.h>.
>> Thank you Mattias for the information, as shared by in the reply with 
>> Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a 
>> extra argument `u32 flags`.
>> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, RTE_GET_LCORE_L3, 
>> RTE_GET_LCORE_BOOST_ENABLED, RTE_GET_LCORE_BOOST_DISABLED.
> 
> Wouldn't using that API be pretty awkward to use?
> 
> I mean, what you have is a topology, with nodes of different types and with 
> different properties, and you want to present it to the user.
> 
> In a sense, it's similar to XCM and DOM versus SAX. The above is SAX-style, 
> and what I have in mind is something DOM-like.
> 
> What use case do you have in mind? What's on top of my list is a scenario 
> where a DPDK app gets a bunch of cores (e.g., -l <cores>) and tries to figure 
> out how best make use of them. It's not going to "skip" (ignore, leave 
> unused) SMT siblings, or skip non-boosted cores, it would just try to be 
> clever in regards to which cores to use for what purpose.
> 
>> This is AMD EPYC SoC agnostic and trying to address for all generic cases.
>> Please do let us know if we (Ferruh & myself) can sync up via call?
> 
> Sure, I can do that.
> 
Can this be opened to the rest of the community? This is a common problem that 
needs to be solved for multiple architectures. I would be interested in 
attending.

>>> 
>>> One can have a look at how scheduling domains work in the Linux kernel.
>>> They model this kind of thing.
>>> 
>>>> As shared in response for cover letter +1 to expand it to more than
>>>> just LLC cores. We have also confirmed the same to
>>>> https://patchwork.dpdk.org/project/dpdk/cover/20240827151014.201-
>>> 1-vip
>>>> in.vargh...@amd.com/
>>>> 
>>>>> 
>>>>> A DPDK CPU/memory hierarchy topology API very much makes sense, but
>>>>> it should be reasonably generic and complete from the start.
>>>>> 
>>>>>>> 
>>>>>>> Could potentially be built on the 'hwloc' library.
>>>>>> 
>>>>>> There are 3 reason on AMD SoC we did not explore this path, reasons
>>>>>> are
>>>>>> 
>>>>>> 1. depending n hwloc version and kernel version certain SoC
>>>>>> hierarchies are not available
>>>>>> 
>>>>>> 2. CPU NUMA and IO (memory & PCIe) NUMA are independent on AMD
>>>>> Epyc Soc.
>>>>>> 
>>>>>> 3. adds the extra dependency layer of library layer to be made
>>>>>> available to work.
>>>>>> 
>>>>>> 
>>>>>> hence we have tried to use Linux Documented generic layer of `sysfs
>>>>>> CPU cache`.
>>>>>> 
>>>>>> I will try to explore more on hwloc and check if other libraries
>>>>>> within DPDK leverages the same.
>>>>>> 
>>>>>>> 
>>>>>>> I much agree cache/core topology may be of interest of the
>>>>>>> application (or a work scheduler, like a DPDK event device), but
>>>>>>> it's not limited to LLC. It may well be worthwhile to care about
>>>>>>> which cores shares L2 cache, for example. Not sure the
>>>>>>> RTE_LCORE_FOREACH_*
>>>>> approach scales.
>>>>>> 
>>>>>> yes, totally understand as some SoC, multiple lcores shares same L2 
>>>>>> cache.
>>>>>> 
>>>>>> 
>>>>>> Can we rework the API to be rte_get_cache_<function> where user
>>>>>> argument is desired lcore index.
>>>>>> 
>>>>>> 1. index-1: SMT threads
>>>>>> 
>>>>>> 2. index-2: threads sharing same L2 cache
>>>>>> 
>>>>>> 3. index-3: threads sharing same L3 cache
>>>>>> 
>>>>>> 4. index-MAX: identify the threads sharing last level cache.
>>>>>> 
>>>>>>> 
>>>>>>>> < Function: Purpose >
>>>>>>>> ---------------------
>>>>>>>>    - rte_get_llc_first_lcores: Retrieves all the first lcores in
>>>>>>>> the shared LLC.
>>>>>>>>    - rte_get_llc_lcore: Retrieves all lcores that share the LLC.
>>>>>>>>    - rte_get_llc_n_lcore: Retrieves the first n or skips the first
>>>>>>>> n lcores in the shared LLC.
>>>>>>>> 
>>>>>>>> < MACRO: Purpose >
>>>>>>>> ------------------
>>>>>>>> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from
>>>>>>>> each LLC.
>>>>>>>> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first
>>>>>>>> worker lcore from each LLC.
>>>>>>>> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on
>>>>> hint
>>>>>>>> (lcore id).
>>>>>>>> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from
>>> LLC
>>>>>>>> while skipping first worker.
>>>>>>>> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n`
>>> lcores
>>>>>>>> from each LLC.
>>>>>>>> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then
>>>>>>>> iterates through reaming lcores in each LLC.
>>>>>>>> 
>>>>>> While the MACRO are simple wrapper invoking appropriate API. can
>>>>>> this be worked out in this fashion?
>>>>>> 
>>>>>> <snipped>

Reply via email to