On 2024-09-12 11:17, Bruce Richardson wrote:
On Thu, Sep 12, 2024 at 02:19:07AM +0000, Varghese, Vipin wrote:[Public]<snipped> > > > > <snipped> > > > > > > > >>> <snipped> > > > >>> > > > >>> Thank you Mattias for the comments and question, please let me > > > >>> try to explain the same below > > > >>> > > > >>>> We shouldn't have a separate CPU/cache hierarchy API instead? > > > >>> > > > >>> Based on the intention to bring in CPU lcores which share same > > > >>> L3 (for better cache hits and less noisy neighbor) current API > > > >>> focuses on using > > > >>> > > > >>> Last Level Cache. But if the suggestion is `there are SoC where > > > >>> L2 cache are also shared, and the new API should be > > > >>> provisioned`, I am also > > > >>> > > > >>> comfortable with the thought. > > > >>> > > > >> > > > >> Rather than some AMD special case API hacked into <rte_lcore.h>, > > > >> I think we are better off with no DPDK API at all for this kind of > functionality. > > > > > > > > Hi Mattias, as shared in the earlier email thread, this is not a > > > > AMD special > > > case at all. Let me try to explain this one more time. One of > > > techniques used to increase cores cost effective way to go for tiles of > compute complexes. > > > > This introduces a bunch of cores in sharing same Last Level Cache > > > > (namely > > > L2, L3 or even L4) depending upon cache topology architecture. > > > > > > > > The API suggested in RFC is to help end users to selectively use > > > > cores under > > > same Last Level Cache Hierarchy as advertised by OS (irrespective of > > > the BIOS settings used). This is useful in both bare-metal and container > environment. > > > > > > > > > > I'm pretty familiar with AMD CPUs and the use of tiles (including > > > the challenges these kinds of non-uniformities pose for work scheduling). > > > > > > To maximize performance, caring about core<->LLC relationship may > > > well not be enough, and more HT/core/cache/memory topology > > > information is required. That's what I meant by special case. A > > > proper API should allow access to information about which lcores are > > > SMT siblings, cores on the same L2, and cores on the same L3, to > > > name a few things. Probably you want to fit NUMA into the same API > > > as well, although that is available already in <rte_lcore.h>. > > > > Thank you Mattias for the information, as shared by in the reply with > Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a > extra argument `u32 flags`. > > The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, > RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, > RTE_GET_LCORE_BOOST_DISABLED. > > > > For the naming, would "rte_get_next_sibling_core" (or lcore if you prefer) be a > clearer name than just adding "ex" on to the end of the existing function? Thank you Bruce, Please find my answer below Functions shared as per the RFC were ``` - rte_get_llc_first_lcores: Retrieves all the first lcores in the shared LLC. - rte_get_llc_lcore: Retrieves all lcores that share the LLC. - rte_get_llc_n_lcore: Retrieves the first n or skips the first n lcores in the shared LLC. ``` MACRO’s extending the usability were ``` RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from each LLC. RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker lcore from each LLC. RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint (lcore id). RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC while skipping first worker. RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores from each LLC. RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then iterates through reaming lcores in each LLC. ``` Based on the discussions we agreed on sharing version-2 FRC for extending API as `rte_get_next_lcore_extnd` with extra argument as `flags`. As per my ideation, for the API ` rte_get_next_sibling_core`, the above API can easily with flag ` RTE_GET_LCORE_L1 (SMT)`. Is this right understanding? We can easily have simple MACROs like `RTE_LCORE_FOREACH_L1` which allows to iterate SMT sibling threads.This seems like a lot of new macro and API additions! I'd really like to cut that back and simplify the amount of new things we are adding to DPDK for this. I tend to agree with others that external libs would be better for apps that really want to deal with all this.
Conveying HW topology will require a fair bit of API verbiage. I think there's no way around it, other than giving the API user half of the story (or 1% of the story).
That's one of the reasons I think it should be in a separate header file in EAL.
> > Looking logically, I'm not sure about the BOOST_ENABLED and > BOOST_DISABLED flags you propose The idea for the BOOST_ENABLED & BOOST_DISABLED is based on DPDK power library which allows to enable boost. Allow user to select lcores where BOOST is enabled|disabled using MACRO or API. - in a system with multiple possible > standard and boost frequencies what would those correspond to? I now understand the confusion, apologies for mixing the AMD EPYC SoC boost with Intel Turbo. Thank you for pointing out, we will use the terminology ` RTE_GET_LCORE_TURBO`.That still doesn't clarify it for me. If you start mixing in power management related functions in with topology ones things will turn into a real headache. What does boost or turbo correspond to? Is it for cores that have the feature enabled - whether or not it's currently in use - or is it for finding cores that are currently boosted? Do we need additions for cores that are boosted by 100Mhz vs say 300Mhz. What about cores that are in lower frequencies for power-saving. Do we add macros for finding those?
In my world, the operating frequency is a property of a CPU core node in the hardware topology.
lcore discrimination (or classification) shouldn't be built as a myriad of FOREACH macros, but rather generic iteration + app domain logic.
For example, the size of the L3 could be a factor. Should we have a FOREACH_BIG_L3. No.
What's also > missing is a define for getting actual NUMA siblings i.e. those sharing common > memory but not an L3 or anything else. This can be extended into `rte_get_next_lcore_extnd` with flag ` RTE_GET_LCORE_NUMA`. This will allow to grab all lcores under the same sub-memory NUMA as shared by LCORE. If SMT sibling is enabled and DPDK Lcore mask covers the sibling threads, then ` RTE_GET_LCORE_NUMA` get all lcore and sibling threads under same memory NUMA of lcore shared.Yes. That can work. But it means we are basing the implementation on a fixed idea of what topologies there are or can exist. My suggestion below is just to ignore the whole idea of L1 vs L2 vs NUMA - just give the app a way to find it's nearest nodes.
I think we need to agree what is the purpose of this API. Is it the to describe the hardware topology in some details for general-purpose use (including informing the operator, lstopo-style), or just some abstract, simplified representation to be use purely for work scheduling.
After all, the app doesn't want to know the topology just for the sake of knowing it - it wants it to ensure best placement of work on cores! To that end, it just needs to know what cores are near to each other and what are far away.> > My suggestion would be to have the function take just an integer-type e.g. > uint16_t parameter which defines the memory/cache hierarchy level to use, 0 > being lowest, 1 next, and so on. Different systems may have different numbers > of cache levels so lets just make it a zero-based index of levels, rather than > giving explicit defines (except for memory which should probably always be > last). The zero-level will be for "closest neighbour" Good idea, we did prototype this internally. But issue it will keep on adding the number of API into lcore library. To keep the API count less, we are using lcore id as hint to sub-NUMA.I'm unclear about this keeping the API count down - you are proposing a lot of APIs and macros up above. My suggestion is basically to add two APIs and no macros: one API to get the max number of topology-nearness levels, and a second API to get the next sibling a given nearness level from 0(nearest)..N(furthest). If we want, we can also add a FOREACH macro too. Overall, though, as I say above, let's focus on the problem the app actually wants these APIs for, not how we think we should solve it. Apps don't want to know the topology for knowledge sake, they want to use that knowledge to improve performance by pinning tasks to cores. What is the minimum that we need to provide to enable the app to do that? For example, if there are no lcores that share an L1, then from an app topology viewpoint that L1 level may as well not exist, because it provides us no details on how to place our work. For the rare app that does have some esoteric use-case that does actually want to know some intricate details of the topology, then having that app use an external lib is probably a better solution than us trying to cover all possible options in DPDK. My 2c. on this at this stage anyway. /Bruce

