On Mon, 12 Apr 2021 at 17:24, Mel Gorman <mgor...@techsingularity.net> wrote: > > On Mon, Apr 12, 2021 at 02:21:47PM +0200, Vincent Guittot wrote: > > > > Peter, Valentin, Vincent, Mel, etal > > > > > > > > On architectures where we have multiple levels of cache access latencies > > > > within a DIE, (For example: one within the current LLC or SMT core and > > > > the > > > > other at MC or Hemisphere, and finally across hemispheres), do you have > > > > any > > > > suggestions on how we could handle the same in the core scheduler? > > > > I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't > > only rely on cache > > > > From topology.c > > SD_SHARE_PKG_RESOURCES - describes shared caches > > I'm guessing here because I am not familiar with power10 but the central > problem appears to be when to prefer selecting a CPU sharing L2 or L3 > cache and the core assumes the last-level-cache is the only relevant one. > > For this patch, I wondered if setting SD_SHARE_PKG_RESOURCES would have > unintended consequences for load balancing because load within a die may > not be spread between SMT4 domains if SD_SHARE_PKG_RESOURCES was set at > the MC level.
But the SMT4 level is still present here with select_idle_core taking of the spreading > > > > > > > Minimally I think it would be worth detecting when there are multiple > > > LLCs per node and detecting that in generic code as a static branch. In > > > select_idle_cpu, consider taking two passes -- first on the LLC domain > > > and if no idle CPU is found then taking a second pass if the search depth > > > > We have done a lot of changes to reduce and optimize the fast path and > > I don't think re adding another layer in the fast path makes sense as > > you will end up unrolling the for_each_domain behind some > > static_banches. > > > > Searching the node would only happen if a) there was enough search depth > left and b) there were no idle CPUs at the LLC level. As no new domain > is added, it's not clear to me why for_each_domain would change. What I mean is that you should directly do for_each_sched_domain in the fast path because that what you are proposing at the end. It's no more looks like a fast path but a traditional LB > > But still, your comment reminded me that different architectures have > different requirements > > Power 10 appears to prefer CPU selection sharing L2 cache but desires > spillover to L3 when selecting and idle CPU. > > X86 varies, it might want the Power10 approach for some families and prefer > L3 spilling over to a CPU on the same node in others. > > S390 cares about something called books and drawers although I've no > what it means as such and whether it has any preferences on > search order. > > ARM has similar requirements again according to "scheduler: expose the > topology of clusters and add cluster scheduler" and that one *does* > add another domain. > > I had forgotten about the ARM patches but remembered that they were > interesting because they potentially help the Zen situation but I didn't > get the chance to review them before they fell off my radar again. About > all I recall is that I thought the "cluster" terminology was vague. > > The only commonality I thought might exist is that architectures may > like to define what the first domain to search for an idle CPU and a > second domain. Alternatively, architectures could specify a domain to > search primarily but also search the next domain in the hierarchy if > search depth permits. The default would be the existing behaviour -- > search CPUs sharing a last-level-cache. > > > SD_SHARE_PKG_RESOURCES should be set to the last level where we can > > efficiently move task between CPUs at wakeup > > > > The definition of "efficiently" varies. Moving tasks between CPUs sharing > a cache is most efficient but moving the task to a CPU that at least has > local memory channels is a reasonable option if there are no idle CPUs > sharing cache and preferable to stacking. That's why setting SD_SHARE_PKG_RESOURCES for P10 looks fine to me. This last level of SD_SHARE_PKG_RESOURCES should define the cpumask to be considered in fast path > > > > allows within the node with the LLC CPUs masked out. While there would be > > > a latency hit because cache is not shared, it would still be a CPU local > > > to memory that is idle. That would potentially be beneficial on Zen* > > > as well without having to introduce new domains in the topology hierarchy. > > > > What is the current sched_domain topology description for zen ? > > > > The cache and NUMA topologies differ slightly between each generation > of Zen. The common pattern is that a single NUMA node can have multiple > L3 caches and at one point I thought it might be reasonable to allow > spillover to select a local idle CPU instead of stacking multiple tasks > on a CPU sharing cache. I never got as far as thinking how it could be > done in a way that multiple architectures would be happy with. > > -- > Mel Gorman > SUSE Labs