[Public] Snipped
> > > > > > Based on the discussions we agreed on sharing version-2 FRC for > > extending API as `rte_get_next_lcore_extnd` with extra argument as > > `flags`. > > > > As per my ideation, for the API ` rte_get_next_sibling_core`, the above > > API can easily with flag ` RTE_GET_LCORE_L1 (SMT)`. Is this right > > understanding? > > > > We can easily have simple MACROs like `RTE_LCORE_FOREACH_L1` which > > allows to iterate SMT sibling threads. > > > > > > This seems like a lot of new macro and API additions! I'd really like to cut > that > back and simplify the amount of new things we are adding to DPDK for this. I disagree Bruce, as per the new conversation with Anatoly and you it has been shared the new API are ``` 1. rte_get_next_lcore_exntd 2. rte_get_next_n_lcore_exntd ``` While I mentioned custom Macro can augment based on typical flag usage similar to ` RTE_LCORE_FOREACH and RTE_LCORE_FOREACH_WORKER` as ``` RTE_LCORE_FOREACH_FLAG RTE_LCORE_FOREACH_WORKER_FLAG Or RTE_LCORE_FOREACH_LLC RTE_LCORE_FOREACH_WORKER_LLC ``` Please note I have not even shared version-2 of RFC yet. > I tend to agree with others that external libs would be better for apps that > really want to deal with all this. I have covered why this is not a good idea for Mattias query. > > > > > > > > > > > Looking logically, I'm not sure about the BOOST_ENABLED and > > BOOST_DISABLED flags you propose > > The idea for the BOOST_ENABLED & BOOST_DISABLED is based on DPDK power > > library which allows to enable boost. > > Allow user to select lcores where BOOST is enabled|disabled using MACRO > > or API. May be there is confusion, so let me try to be explicit here. The intention of any `rte_get_next_lcore##` is fetch lcores. Hence with new proposed API `rte_get_next_lcore_exntd` with `flag set for Boost` is to fetch lcores where boost is enabled. There is no intention to enable or disable boost on lcore with `get` API. > > > > > > > > - in a system with multiple possible > > > > > standard and boost frequencies what would those correspond to? > > > > I now understand the confusion, apologies for mixing the AMD EPYC SoC > > boost with Intel Turbo. > > > > > > > > Thank you for pointing out, we will use the terminology ` > > RTE_GET_LCORE_TURBO`. > > > > > > That still doesn't clarify it for me. If you start mixing in power management > related functions in with topology ones things will turn into a real headache. Can you please tell me what is not clarified. DPDK lcores as of today has no notion of Cache, Numa, Power, Turbo or any DPDK supported features. The initial API introduced were to expose lcore sharing the same Last Level Cache. Based on interaction with Anatoly, extending this to support multiple features turned out to be possibility. Hence, we said we can share v2 for RFC based on this idea. But if the claim is not to put TURBO I am also ok for this. Let only keep cache and NUMA-IO domain. > What does boost or turbo correspond to? Is it for cores that have the feature > enabled - whether or not it's currently in use - or is it for finding cores > that are > currently boosted? Do we need additions for cores that are boosted by 100Mhz > vs say 300Mhz. What about cores that are in lower frequencies for > power-saving. Do we add macros for finding those? Why are we talking about feq-up and freq-down? This was not even discussed in this RFC patch at all. > > > > What's also > > > > > missing is a define for getting actual NUMA siblings i.e. those > > sharing common memory but not an L3 or anything else. > > > > This can be extended into `rte_get_next_lcore_extnd` with flag ` > > RTE_GET_LCORE_NUMA`. This will allow to grab all lcores under the same > > sub-memory NUMA as shared by LCORE. > > > > If SMT sibling is enabled and DPDK Lcore mask covers the sibling > > threads, then ` RTE_GET_LCORE_NUMA` get all lcore and sibling threads > > under same memory NUMA of lcore shared. > > > > > > Yes. That can work. But it means we are basing the implementation on a fixed > idea of what topologies there are or can exist. > My suggestion below is just to ignore the whole idea of L1 vs L2 vs NUMA - > just give the app a way to find it's nearest nodes. Bruce, for different vendor SoC, the implementation of architecture is different. Let me share what I know 1. using L1, we can fetch SMT threads 2. using L2 we can get certain SoC on Arm, Intel and power PC which is like efficient cores 3. using L3 we can get certain SoC like AMD, AF64x and others which follow chiplet or tile split L3 domain. > > After all, the app doesn't want to know the topology just for the sake of > knowing it - it wants it to ensure best placement of work on cores! To that > end, it just needs to know what cores are near to each other and what are far > away. Exactly, that is why we want to minimize new libraries and limit to format of existing API `rte_get_next_lcore`. The end user need to deploy another library or external library then map to DPDK lcore mapping to identify what is where. So as end user I prefer simple API which get my work done. > > > > > > > > > > > My suggestion would be to have the function take just an integer-type > > e.g. > > > > > uint16_t parameter which defines the memory/cache hierarchy level to > > use, 0 > > > > > being lowest, 1 next, and so on. Different systems may have different > > numbers > > > > > of cache levels so lets just make it a zero-based index of levels, > > rather than > > > > > giving explicit defines (except for memory which should probably > > always be > > > > > last). The zero-level will be for "closest neighbour" > > > > Good idea, we did prototype this internally. But issue it will keep on > > adding the number of API into lcore library. > > > > To keep the API count less, we are using lcore id as hint to sub-NUMA. > > > > I'm unclear about this keeping the API count down - you are proposing a lot > of APIs and macros up above. No, I am not. I have shared based on the last discussion with Anatoly we will end up with 2 API in lcore only. Explained in the above response > My suggestion is basically to add two APIs and no macros: one API to get the > max number of topology-nearness levels, and a > second API to get the next sibling a given nearness level from > 0(nearest)..N(furthest). If we want, we can also add a FOREACH macro too. > > Overall, though, as I say above, let's focus on the problem the app actually > wants these APIs for, not how we think we should solve it. Apps don't want to > know the topology for knowledge sake, they want to use that knowledge to > improve performance by pinning tasks to cores. What is the minimum that we > need to provide to enable the app to do that? For example, if there are no > lcores that share an L1, then from an app topology viewpoint that L1 level may > as well not exist, because it provides us no details on how to place our work. I have shared above why we need vendor agnostic L1, L2, L3 and sub-NUMA-IO. Snipped