[Public]
Snipped
Based on the discussions we agreed on sharing version-2 FRC for
extending API as `rte_get_next_lcore_extnd` with extra argument as
`flags`.
As per my ideation, for the API ` rte_get_next_sibling_core`, the above
API can easily with flag ` RTE_GET_LCORE_L1 (SMT)`. Is this right
understanding?
We can easily have simple MACROs like `RTE_LCORE_FOREACH_L1` which
allows to iterate SMT sibling threads.
This seems like a lot of new macro and API additions! I'd really like to cut
that
back and simplify the amount of new things we are adding to DPDK for this.
I disagree Bruce, as per the new conversation with Anatoly and you it has been
shared the new API are
```
1. rte_get_next_lcore_exntd
2. rte_get_next_n_lcore_exntd
```
While I mentioned custom Macro can augment based on typical flag usage similar
to ` RTE_LCORE_FOREACH and RTE_LCORE_FOREACH_WORKER` as
```
RTE_LCORE_FOREACH_FLAG
RTE_LCORE_FOREACH_WORKER_FLAG
Or
RTE_LCORE_FOREACH_LLC
RTE_LCORE_FOREACH_WORKER_LLC
```
Please note I have not even shared version-2 of RFC yet.
I tend to agree with others that external libs would be better for apps that
really want to deal with all this.
I have covered why this is not a good idea for Mattias query.
>
> Looking logically, I'm not sure about the BOOST_ENABLED and
BOOST_DISABLED flags you propose
The idea for the BOOST_ENABLED & BOOST_DISABLED is based on DPDK power
library which allows to enable boost.
Allow user to select lcores where BOOST is enabled|disabled using MACRO or
API.
May be there is confusion, so let me try to be explicit here. The intention of
any `rte_get_next_lcore##` is fetch lcores.
Hence with new proposed API `rte_get_next_lcore_exntd` with `flag set for
Boost` is to fetch lcores where boost is enabled.
There is no intention to enable or disable boost on lcore with `get` API.
- in a system with multiple possible
> standard and boost frequencies what would those correspond to?
I now understand the confusion, apologies for mixing the AMD EPYC SoC
boost with Intel Turbo.
Thank you for pointing out, we will use the terminology `
RTE_GET_LCORE_TURBO`.
That still doesn't clarify it for me. If you start mixing in power management
related functions in with topology ones things will turn into a real headache.
Can you please tell me what is not clarified. DPDK lcores as of today has no
notion of Cache, Numa, Power, Turbo or any DPDK supported features.
The initial API introduced were to expose lcore sharing the same Last Level
Cache. Based on interaction with Anatoly, extending this to support multiple
features turned out to be possibility.
Hence, we said we can share v2 for RFC based on this idea.
But if the claim is not to put TURBO I am also ok for this. Let only keep cache
and NUMA-IO domain.
What does boost or turbo correspond to? Is it for cores that have the feature
enabled - whether or not it's currently in use - or is it for finding cores
that are
currently boosted? Do we need additions for cores that are boosted by 100Mhz
vs say 300Mhz. What about cores that are in lower frequencies for
power-saving. Do we add macros for finding those?
Why are we talking about feq-up and freq-down? This was not even discussed in
this RFC patch at all.
What's also
> missing is a define for getting actual NUMA siblings i.e. those
sharing common memory but not an L3 or anything else.
This can be extended into `rte_get_next_lcore_extnd` with flag `
RTE_GET_LCORE_NUMA`. This will allow to grab all lcores under the same
sub-memory NUMA as shared by LCORE.
If SMT sibling is enabled and DPDK Lcore mask covers the sibling
threads, then ` RTE_GET_LCORE_NUMA` get all lcore and sibling threads
under same memory NUMA of lcore shared.
Yes. That can work. But it means we are basing the implementation on a fixed
idea of what topologies there are or can exist.
My suggestion below is just to ignore the whole idea of L1 vs L2 vs NUMA - just
give the app a way to find it's nearest nodes.
Bruce, for different vendor SoC, the implementation of architecture is
different. Let me share what I know
1. using L1, we can fetch SMT threads
2. using L2 we can get certain SoC on Arm, Intel and power PC which is like
efficient cores
3. using L3 we can get certain SoC like AMD, AF64x and others which follow
chiplet or tile split L3 domain.
After all, the app doesn't want to know the topology just for the sake of
knowing it - it wants it to ensure best placement of work on cores! To that
end, it just needs to know what cores are near to each other and what are far
away.
Exactly, that is why we want to minimize new libraries and limit to format of
existing API `rte_get_next_lcore`. The end user need to deploy another library
or external library then map to DPDK lcore mapping to identify what is where.
So as end user I prefer simple API which get my work done.
>
> My suggestion would be to have the function take just an integer-type
e.g.
> uint16_t parameter which defines the memory/cache hierarchy level to
use, 0
> being lowest, 1 next, and so on. Different systems may have different
numbers
> of cache levels so lets just make it a zero-based index of levels,
rather than
> giving explicit defines (except for memory which should probably
always be
> last). The zero-level will be for "closest neighbour"
Good idea, we did prototype this internally. But issue it will keep on
adding the number of API into lcore library.
To keep the API count less, we are using lcore id as hint to sub-NUMA.
I'm unclear about this keeping the API count down - you are proposing a lot of
APIs and macros up above.
No, I am not. I have shared based on the last discussion with Anatoly we will
end up with 2 API in lcore only. Explained in the above response
My suggestion is basically to add two APIs and no macros: one API to get the
max number of topology-nearness levels, and a
second API to get the next sibling a given nearness level from
0(nearest)..N(furthest). If we want, we can also add a FOREACH macro too.
Overall, though, as I say above, let's focus on the problem the app actually
wants these APIs for, not how we think we should solve it. Apps don't want to
know the topology for knowledge sake, they want to use that knowledge to
improve performance by pinning tasks to cores. What is the minimum that we
need to provide to enable the app to do that? For example, if there are no
lcores that share an L1, then from an app topology viewpoint that L1 level may
as well not exist, because it provides us no details on how to place our work.
I have shared above why we need vendor agnostic L1, L2, L3 and sub-NUMA-IO.
Snipped