On Thu, Sep 12, 2024 at 02:19:07AM +0000, Varghese, Vipin wrote: > [Public] > > <snipped> > > > > > > > > <snipped> > > > > > > > > > > > >>> <snipped> > > > > > >>> > > > > > >>> Thank you Mattias for the comments and question, please let > me > > > > > >>> try to explain the same below > > > > > >>> > > > > > >>>> We shouldn't have a separate CPU/cache hierarchy API > instead? > > > > > >>> > > > > > >>> Based on the intention to bring in CPU lcores which share > same > > > > > >>> L3 (for better cache hits and less noisy neighbor) current > API > > > > > >>> focuses on using > > > > > >>> > > > > > >>> Last Level Cache. But if the suggestion is `there are SoC > where > > > > > >>> L2 cache are also shared, and the new API should be > > > > > >>> provisioned`, I am also > > > > > >>> > > > > > >>> comfortable with the thought. > > > > > >>> > > > > > >> > > > > > >> Rather than some AMD special case API hacked into > <rte_lcore.h>, > > > > > >> I think we are better off with no DPDK API at all for this > kind of > > > functionality. > > > > > > > > > > > > Hi Mattias, as shared in the earlier email thread, this is not > a > > > > > > AMD special > > > > > case at all. Let me try to explain this one more time. One of > > > > > techniques used to increase cores cost effective way to go for > tiles of > > > compute complexes. > > > > > > This introduces a bunch of cores in sharing same Last Level > Cache > > > > > > (namely > > > > > L2, L3 or even L4) depending upon cache topology architecture. > > > > > > > > > > > > The API suggested in RFC is to help end users to selectively > use > > > > > > cores under > > > > > same Last Level Cache Hierarchy as advertised by OS (irrespective > of > > > > > the BIOS settings used). This is useful in both bare-metal and > container > > > environment. > > > > > > > > > > > > > > > > I'm pretty familiar with AMD CPUs and the use of tiles (including > > > > > the challenges these kinds of non-uniformities pose for work > scheduling). > > > > > > > > > > To maximize performance, caring about core<->LLC relationship may > > > > > well not be enough, and more HT/core/cache/memory topology > > > > > information is required. That's what I meant by special case. A > > > > > proper API should allow access to information about which lcores > are > > > > > SMT siblings, cores on the same L2, and cores on the same L3, to > > > > > name a few things. Probably you want to fit NUMA into the same > API > > > > > as well, although that is available already in <rte_lcore.h>. > > > > > > > > Thank you Mattias for the information, as shared by in the reply > with > > > Anatoly we want expose a new API `rte_get_next_lcore_ex` which > intakes a > > > extra argument `u32 flags`. > > > > The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, > > > RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, > > > RTE_GET_LCORE_BOOST_DISABLED. > > > > > > > > > > For the naming, would "rte_get_next_sibling_core" (or lcore if you > prefer) be a > > > clearer name than just adding "ex" on to the end of the existing > function? > > Thank you Bruce, Please find my answer below > > > > Functions shared as per the RFC were > > ``` > > - rte_get_llc_first_lcores: Retrieves all the first lcores in the > shared LLC. > > - rte_get_llc_lcore: Retrieves all lcores that share the LLC. > > - rte_get_llc_n_lcore: Retrieves the first n or skips the first n > lcores in the shared LLC. > > ``` > > > > MACRO’s extending the usability were > > ``` > > RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from each > LLC. > > RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker > lcore from each LLC. > > RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint > (lcore id). > > RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC while > skipping first worker. > > RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores from > each LLC. > > RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then > iterates through reaming lcores in each LLC. > > ``` > > > > Based on the discussions we agreed on sharing version-2 FRC for > extending API as `rte_get_next_lcore_extnd` with extra argument as > `flags`. > > As per my ideation, for the API ` rte_get_next_sibling_core`, the above > API can easily with flag ` RTE_GET_LCORE_L1 (SMT)`. Is this right > understanding? > > We can easily have simple MACROs like `RTE_LCORE_FOREACH_L1` which > allows to iterate SMT sibling threads. > >
This seems like a lot of new macro and API additions! I'd really like to cut that back and simplify the amount of new things we are adding to DPDK for this. I tend to agree with others that external libs would be better for apps that really want to deal with all this. > > > > > > Looking logically, I'm not sure about the BOOST_ENABLED and > > > BOOST_DISABLED flags you propose > > The idea for the BOOST_ENABLED & BOOST_DISABLED is based on DPDK power > library which allows to enable boost. > > Allow user to select lcores where BOOST is enabled|disabled using MACRO > or API. > > > > - in a system with multiple possible > > > standard and boost frequencies what would those correspond to? > > I now understand the confusion, apologies for mixing the AMD EPYC SoC > boost with Intel Turbo. > > > > Thank you for pointing out, we will use the terminology ` > RTE_GET_LCORE_TURBO`. > > That still doesn't clarify it for me. If you start mixing in power management related functions in with topology ones things will turn into a real headache. What does boost or turbo correspond to? Is it for cores that have the feature enabled - whether or not it's currently in use - or is it for finding cores that are currently boosted? Do we need additions for cores that are boosted by 100Mhz vs say 300Mhz. What about cores that are in lower frequencies for power-saving. Do we add macros for finding those? > > What's also > > > missing is a define for getting actual NUMA siblings i.e. those > sharing common > > > memory but not an L3 or anything else. > > This can be extended into `rte_get_next_lcore_extnd` with flag ` > RTE_GET_LCORE_NUMA`. This will allow to grab all lcores under the same > sub-memory NUMA as shared by LCORE. > > If SMT sibling is enabled and DPDK Lcore mask covers the sibling > threads, then ` RTE_GET_LCORE_NUMA` get all lcore and sibling threads > under same memory NUMA of lcore shared. > > Yes. That can work. But it means we are basing the implementation on a fixed idea of what topologies there are or can exist. My suggestion below is just to ignore the whole idea of L1 vs L2 vs NUMA - just give the app a way to find it's nearest nodes. After all, the app doesn't want to know the topology just for the sake of knowing it - it wants it to ensure best placement of work on cores! To that end, it just needs to know what cores are near to each other and what are far away. > > > > > > My suggestion would be to have the function take just an integer-type > e.g. > > > uint16_t parameter which defines the memory/cache hierarchy level to > use, 0 > > > being lowest, 1 next, and so on. Different systems may have different > numbers > > > of cache levels so lets just make it a zero-based index of levels, > rather than > > > giving explicit defines (except for memory which should probably > always be > > > last). The zero-level will be for "closest neighbour" > > Good idea, we did prototype this internally. But issue it will keep on > adding the number of API into lcore library. > > To keep the API count less, we are using lcore id as hint to sub-NUMA. > I'm unclear about this keeping the API count down - you are proposing a lot of APIs and macros up above. My suggestion is basically to add two APIs and no macros: one API to get the max number of topology-nearness levels, and a second API to get the next sibling a given nearness level from 0(nearest)..N(furthest). If we want, we can also add a FOREACH macro too. Overall, though, as I say above, let's focus on the problem the app actually wants these APIs for, not how we think we should solve it. Apps don't want to know the topology for knowledge sake, they want to use that knowledge to improve performance by pinning tasks to cores. What is the minimum that we need to provide to enable the app to do that? For example, if there are no lcores that share an L1, then from an app topology viewpoint that L1 level may as well not exist, because it provides us no details on how to place our work. For the rare app that does have some esoteric use-case that does actually want to know some intricate details of the topology, then having that app use an external lib is probably a better solution than us trying to cover all possible options in DPDK. My 2c. on this at this stage anyway. /Bruce