* Valentin Schneider <valentin.schnei...@arm.com> [2020-07-28 16:03:11]:
Hi Valentin, Thanks for looking into the patches. > On 27/07/20 06:32, Srikar Dronamraju wrote: > > Add percpu coregroup maps and masks to create coregroup domain. > > If a coregroup doesn't exist, the coregroup domain will be degenerated > > in favour of SMT/CACHE domain. > > > > So there's at least one arm64 platform out there with the same "pairs of > cores share L2" thing (Ampere eMAG), and that lives quite happily with the > default scheduler topology (SMT/MC/DIE). Each pair of core gets its MC > domain, and the whole system is covered by DIE. > > Now arguably it's not a perfect representation; DIE doesn't have > SD_SHARE_PKG_RESOURCES so the highest level sd_llc can point to is MC. That > will impact all callsites using cpus_share_cache(): in the eMAG case, only > pairs of cores will be seen as sharing cache, even though *all* cores share > the same L3. > Okay, Its good to know that we have a chip which is similar to P9 in topology. > I'm trying to paint a picture of what the P9 topology looks like (the one > you showcase in your cover letter) to see if there are any similarities; > from what I gather in [1], wikichips and your cover letter, with P9 you can > have something like this in a single DIE (somewhat unsure about L3 setup; > it looks to be distributed?) > > +---------------------------------------------------------------------+ > | L3 | > +---------------+-+---------------+-+---------------+-+---------------+ > | L2 | | L2 | | L2 | | L2 | > +------+-+------+ +------+-+------+ +------+-+------+ +------+-+------+ > | L1 | | L1 | | L1 | | L1 | | L1 | | L1 | | L1 | | L1 | > +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ > |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| > +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ > > Which would lead to (ignoring the whole SMT CPU numbering shenanigans) > > NUMA [ ... > DIE [ ] > MC [ ] [ ] [ ] [ ] > BIGCORE [ ] [ ] [ ] [ ] > SMT [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] > 00-03 04-07 08-11 12-15 16-19 20-23 24-27 28-31 <other node here> > What you have summed up is perfectly what a P9 topology looks like. I dont think I could have explained it better than this. > This however has MC == BIGCORE; what makes it you can have different spans > for these two domains? If it's not too much to ask, I'd love to have a P9 > topology diagram. > > [1]: 20200722081822.gg9...@linux.vnet.ibm.com At this time the current topology would be good enough i.e BIGCORE would always be equal to a MC. However in future we could have chips that can have lesser/larger number of CPUs in llc than in a BIGCORE or we could have granular or split L3 caches within a DIE. In such a case BIGCORE != MC. Also in the current P9 itself, two neighbouring core-pairs form a quad. Cache latency within a quad is better than a latency to a distant core-pair. Cache latency within a core pair is way better than latency within a quad. So if we have only 4 threads running on a DIE all of them accessing the same cache-lines, then we could probably benefit if all the tasks were to run within the quad aka MC/Coregroup. I have found some benchmarks which are latency sensitive to benefit by having a grouping a quad level (using kernel hacks and not backed by firmware changes). Gautham also found similar results in his experiments but he only used binding within the stock kernel. I am not setting SD_SHARE_PKG_RESOURCES in MC/Coregroup sd_flags as in MC domain need not be LLC domain for Power. -- Thanks and Regards Srikar Dronamraju