On Wed, Sep 11, 2024 at 03:26:20AM +0000, Varghese, Vipin wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> <snipped>
> 
> >
> > On 2024-09-09 16:22, Varghese, Vipin wrote:
> > > [AMD Official Use Only - AMD Internal Distribution Only]
> > >
> > > <snipped>
> > >
> > >>> <snipped>
> > >>>
> > >>> Thank you Mattias for the comments and question, please let me try
> > >>> to explain the same below
> > >>>
> > >>>> We shouldn't have a separate CPU/cache hierarchy API instead?
> > >>>
> > >>> Based on the intention to bring in CPU lcores which share same L3
> > >>> (for better cache hits and less noisy neighbor) current API focuses
> > >>> on using
> > >>>
> > >>> Last Level Cache. But if the suggestion is `there are SoC where L2
> > >>> cache are also shared, and the new API should be provisioned`, I am
> > >>> also
> > >>>
> > >>> comfortable with the thought.
> > >>>
> > >>
> > >> Rather than some AMD special case API hacked into <rte_lcore.h>, I
> > >> think we are better off with no DPDK API at all for this kind of 
> > >> functionality.
> > >
> > > Hi Mattias, as shared in the earlier email thread, this is not a AMD 
> > > special
> > case at all. Let me try to explain this one more time. One of techniques 
> > used to
> > increase cores cost effective way to go for tiles of compute complexes.
> > > This introduces a bunch of cores in sharing same Last Level Cache (namely
> > L2, L3 or even L4) depending upon cache topology architecture.
> > >
> > > The API suggested in RFC is to help end users to selectively use cores 
> > > under
> > same Last Level Cache Hierarchy as advertised by OS (irrespective of the 
> > BIOS
> > settings used). This is useful in both bare-metal and container environment.
> > >
> >
> > I'm pretty familiar with AMD CPUs and the use of tiles (including the
> > challenges these kinds of non-uniformities pose for work scheduling).
> >
> > To maximize performance, caring about core<->LLC relationship may well not
> > be enough, and more HT/core/cache/memory topology information is
> > required. That's what I meant by special case. A proper API should allow
> > access to information about which lcores are SMT siblings, cores on the same
> > L2, and cores on the same L3, to name a few things. Probably you want to fit
> > NUMA into the same API as well, although that is available already in
> > <rte_lcore.h>.
> 
> Thank you Mattias for the information, as shared by in the reply with Anatoly 
> we want expose a new API `rte_get_next_lcore_ex` which intakes a extra 
> argument `u32 flags`.
> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, RTE_GET_LCORE_L3, 
> RTE_GET_LCORE_BOOST_ENABLED, RTE_GET_LCORE_BOOST_DISABLED.
> 

For the naming, would "rte_get_next_sibling_core" (or lcore if you prefer)
be a clearer name than just adding "ex" on to the end of the existing
function?

Looking logically, I'm not sure about the BOOST_ENABLED and BOOST_DISABLED
flags you propose - in a system with multiple possible standard and boost
frequencies what would those correspond to? What's also missing is a define
for getting actual NUMA siblings i.e. those sharing common memory but not
an L3 or anything else.

My suggestion would be to have the function take just an integer-type e.g.
uint16_t parameter which defines the memory/cache hierarchy level to use, 0
being lowest, 1 next, and so on. Different systems may have different
numbers of cache levels so lets just make it a zero-based index of levels,
rather than giving explicit defines (except for memory which should
probably always be last). The zero-level will be for "closest neighbour"
whatever that happens to be, with as many levels as is necessary to express
the topology, e.g. without SMT, but with 3 cache levels, level 0 would be
an L2 neighbour, level 1 an L3 neighbour. If the L3 was split within a
memory NUMA node, then level 2 would give the NUMA siblings. We'd just need
an API to return the max number of levels along with the iterator.

Regards,
/Bruce

Reply via email to