On 2024-09-12 11:17, Bruce Richardson wrote:
On Thu, Sep 12, 2024 at 02:19:07AM +0000, Varghese, Vipin wrote:
    [Public]

    <snipped>



    > > > > <snipped>

    > > > >

    > > > >>> <snipped>

    > > > >>>

    > > > >>> Thank you Mattias for the comments and question, please let
    me

    > > > >>> try to explain the same below

    > > > >>>

    > > > >>>> We shouldn't have a separate CPU/cache hierarchy API
    instead?

    > > > >>>

    > > > >>> Based on the intention to bring in CPU lcores which share
    same

    > > > >>> L3 (for better cache hits and less noisy neighbor) current
    API

    > > > >>> focuses on using

    > > > >>>

    > > > >>> Last Level Cache. But if the suggestion is `there are SoC
    where

    > > > >>> L2 cache are also shared, and the new API should be

    > > > >>> provisioned`, I am also

    > > > >>>

    > > > >>> comfortable with the thought.

    > > > >>>

    > > > >>

    > > > >> Rather than some AMD special case API hacked into
    <rte_lcore.h>,

    > > > >> I think we are better off with no DPDK API at all for this
    kind of

    > functionality.

    > > > >

    > > > > Hi Mattias, as shared in the earlier email thread, this is not
    a

    > > > > AMD special

    > > > case at all. Let me try to explain this one more time. One of

    > > > techniques used to increase cores cost effective way to go for
    tiles of

    > compute complexes.

    > > > > This introduces a bunch of cores in sharing same Last Level
    Cache

    > > > > (namely

    > > > L2, L3 or even L4) depending upon cache topology architecture.

    > > > >

    > > > > The API suggested in RFC is to help end users to selectively
    use

    > > > > cores under

    > > > same Last Level Cache Hierarchy as advertised by OS (irrespective
    of

    > > > the BIOS settings used). This is useful in both bare-metal and
    container

    > environment.

    > > > >

    > > >

    > > > I'm pretty familiar with AMD CPUs and the use of tiles (including

    > > > the challenges these kinds of non-uniformities pose for work
    scheduling).

    > > >

    > > > To maximize performance, caring about core<->LLC relationship may

    > > > well not be enough, and more HT/core/cache/memory topology

    > > > information is required. That's what I meant by special case. A

    > > > proper API should allow access to information about which lcores
    are

    > > > SMT siblings, cores on the same L2, and cores on the same L3, to

    > > > name a few things. Probably you want to fit NUMA into the same
    API

    > > > as well, although that is available already in <rte_lcore.h>.

    > >

    > > Thank you Mattias for the information, as shared by in the reply
    with

    > Anatoly we want expose a new API `rte_get_next_lcore_ex` which
    intakes a

    > extra argument `u32 flags`.

    > > The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2,

    > RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED,

    > RTE_GET_LCORE_BOOST_DISABLED.

    > >

    >

    > For the naming, would "rte_get_next_sibling_core" (or lcore if you
    prefer) be a

    > clearer name than just adding "ex" on to the end of the existing
    function?

    Thank you Bruce, Please find my answer below



    Functions shared as per the RFC were

    ```

    - rte_get_llc_first_lcores: Retrieves all the first lcores in the
    shared LLC.

    - rte_get_llc_lcore: Retrieves all lcores that share the LLC.

    - rte_get_llc_n_lcore: Retrieves the first n or skips the first n
    lcores in the shared LLC.

    ```



    MACRO’s extending the usability were

    ```

    RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from each
    LLC.

    RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker
    lcore from each LLC.

    RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint
    (lcore id).

    RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC while
    skipping first worker.

    RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores from
    each LLC.

    RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then
    iterates through reaming lcores in each LLC.

    ```



    Based on the discussions we agreed on sharing version-2 FRC for
    extending API as `rte_get_next_lcore_extnd` with extra argument as
    `flags`.

    As per my ideation, for the API ` rte_get_next_sibling_core`, the above
    API can easily with flag ` RTE_GET_LCORE_L1 (SMT)`. Is this right
    understanding?

    We can easily have simple MACROs like `RTE_LCORE_FOREACH_L1` which
    allows to iterate SMT sibling threads.



This seems like a lot of new macro and API additions! I'd really like to
cut that back and simplify the amount of new things we are adding to DPDK
for this. I tend to agree with others that external libs would be better
for apps that really want to deal with all this.


Conveying HW topology will require a fair bit of API verbiage. I think there's no way around it, other than giving the API user half of the story (or 1% of the story).

That's one of the reasons I think it should be in a separate header file in EAL.


    >

    > Looking logically, I'm not sure about the BOOST_ENABLED and

    > BOOST_DISABLED flags you propose

    The idea for the BOOST_ENABLED & BOOST_DISABLED is based on DPDK power
    library which allows to enable boost.

    Allow user to select lcores where BOOST is enabled|disabled using MACRO
    or API.



    - in a system with multiple possible

    > standard and boost frequencies what would those correspond to?

    I now understand the confusion, apologies for mixing the AMD EPYC SoC
    boost with Intel Turbo.



    Thank you for pointing out, we will use the terminology `
    RTE_GET_LCORE_TURBO`.



That still doesn't clarify it for me. If you start mixing in power
management related functions in with topology ones things will turn into a
real headache. What does boost or turbo correspond to? Is it for cores that
have the feature enabled - whether or not it's currently in use - or is it
for finding cores that are currently boosted? Do we need additions for
cores that are boosted by 100Mhz vs say 300Mhz. What about cores that are
in lower frequencies for power-saving. Do we add macros for finding those?


In my world, the operating frequency is a property of a CPU core node in the hardware topology.

lcore discrimination (or classification) shouldn't be built as a myriad of FOREACH macros, but rather generic iteration + app domain logic.

For example, the size of the L3 could be a factor. Should we have a FOREACH_BIG_L3. No.


    What's also

    > missing is a define for getting actual NUMA siblings i.e. those
    sharing common

    > memory but not an L3 or anything else.

    This can be extended into `rte_get_next_lcore_extnd` with flag `
    RTE_GET_LCORE_NUMA`. This will allow to grab all lcores under the same
    sub-memory NUMA as shared by LCORE.

    If SMT sibling is enabled and DPDK Lcore mask covers the sibling
    threads, then ` RTE_GET_LCORE_NUMA` get all lcore and sibling threads
    under same memory NUMA of lcore shared.



Yes. That can work. But it means we are basing the implementation on a
fixed idea of what topologies there are or can exist. My suggestion below
is just to ignore the whole idea of L1 vs L2 vs NUMA - just give the app a
way to find it's nearest nodes.


I think we need to agree what is the purpose of this API. Is it the to describe the hardware topology in some details for general-purpose use (including informing the operator, lstopo-style), or just some abstract, simplified representation to be use purely for work scheduling.

After all, the app doesn't want to know the topology just for the sake of
knowing it - it wants it to ensure best placement of work on cores! To that
end, it just needs to know what cores are near to each other and what are
far away.


    >

    > My suggestion would be to have the function take just an integer-type
    e.g.

    > uint16_t parameter which defines the memory/cache hierarchy level to
    use, 0

    > being lowest, 1 next, and so on. Different systems may have different
    numbers

    > of cache levels so lets just make it a zero-based index of levels,
    rather than

    > giving explicit defines (except for memory which should probably
    always be

    > last). The zero-level will be for "closest neighbour"

    Good idea, we did prototype this internally. But issue it will keep on
    adding the number of API into lcore library.

    To keep the API count less, we are using lcore id as hint to sub-NUMA.


I'm unclear about this keeping the API count down - you are proposing a lot
of APIs and macros up above. My suggestion is basically to add two APIs and
no macros: one API to get the max number of topology-nearness levels, and a
second API to get the next sibling a given nearness level from
0(nearest)..N(furthest). If we want, we can also add a FOREACH macro too.

Overall, though, as I say above, let's focus on the problem the app
actually wants these APIs for, not how we think we should solve it. Apps
don't want to know the topology for knowledge sake, they want to use that
knowledge to improve performance by pinning tasks to cores. What is the
minimum that we need to provide to enable the app to do that? For example,
if there are no lcores that share an L1, then from an app topology
viewpoint that L1 level may as well not exist, because it provides us no
details on how to place our work.

For the rare app that does have some esoteric use-case that does actually
want to know some intricate details of the topology, then having that app
use an external lib is probably a better solution than us trying to cover
all possible options in DPDK.

My 2c. on this at this stage anyway.

/Bruce

Reply via email to