On Wed, Sep 11, 2024 at 03:26:20AM +0000, Varghese, Vipin wrote: > [AMD Official Use Only - AMD Internal Distribution Only] > > <snipped> > > > > > On 2024-09-09 16:22, Varghese, Vipin wrote: > > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > > > <snipped> > > > > > >>> <snipped> > > >>> > > >>> Thank you Mattias for the comments and question, please let me try > > >>> to explain the same below > > >>> > > >>>> We shouldn't have a separate CPU/cache hierarchy API instead? > > >>> > > >>> Based on the intention to bring in CPU lcores which share same L3 > > >>> (for better cache hits and less noisy neighbor) current API focuses > > >>> on using > > >>> > > >>> Last Level Cache. But if the suggestion is `there are SoC where L2 > > >>> cache are also shared, and the new API should be provisioned`, I am > > >>> also > > >>> > > >>> comfortable with the thought. > > >>> > > >> > > >> Rather than some AMD special case API hacked into <rte_lcore.h>, I > > >> think we are better off with no DPDK API at all for this kind of > > >> functionality. > > > > > > Hi Mattias, as shared in the earlier email thread, this is not a AMD > > > special > > case at all. Let me try to explain this one more time. One of techniques > > used to > > increase cores cost effective way to go for tiles of compute complexes. > > > This introduces a bunch of cores in sharing same Last Level Cache (namely > > L2, L3 or even L4) depending upon cache topology architecture. > > > > > > The API suggested in RFC is to help end users to selectively use cores > > > under > > same Last Level Cache Hierarchy as advertised by OS (irrespective of the > > BIOS > > settings used). This is useful in both bare-metal and container environment. > > > > > > > I'm pretty familiar with AMD CPUs and the use of tiles (including the > > challenges these kinds of non-uniformities pose for work scheduling). > > > > To maximize performance, caring about core<->LLC relationship may well not > > be enough, and more HT/core/cache/memory topology information is > > required. That's what I meant by special case. A proper API should allow > > access to information about which lcores are SMT siblings, cores on the same > > L2, and cores on the same L3, to name a few things. Probably you want to fit > > NUMA into the same API as well, although that is available already in > > <rte_lcore.h>. > > Thank you Mattias for the information, as shared by in the reply with Anatoly > we want expose a new API `rte_get_next_lcore_ex` which intakes a extra > argument `u32 flags`. > The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, RTE_GET_LCORE_L3, > RTE_GET_LCORE_BOOST_ENABLED, RTE_GET_LCORE_BOOST_DISABLED. >
For the naming, would "rte_get_next_sibling_core" (or lcore if you prefer) be a clearer name than just adding "ex" on to the end of the existing function? Looking logically, I'm not sure about the BOOST_ENABLED and BOOST_DISABLED flags you propose - in a system with multiple possible standard and boost frequencies what would those correspond to? What's also missing is a define for getting actual NUMA siblings i.e. those sharing common memory but not an L3 or anything else. My suggestion would be to have the function take just an integer-type e.g. uint16_t parameter which defines the memory/cache hierarchy level to use, 0 being lowest, 1 next, and so on. Different systems may have different numbers of cache levels so lets just make it a zero-based index of levels, rather than giving explicit defines (except for memory which should probably always be last). The zero-level will be for "closest neighbour" whatever that happens to be, with as many levels as is necessary to express the topology, e.g. without SMT, but with 3 cache levels, level 0 would be an L2 neighbour, level 1 an L3 neighbour. If the L3 was split within a memory NUMA node, then level 2 would give the NUMA siblings. We'd just need an API to return the max number of levels along with the iterator. Regards, /Bruce