Hi Chen, Prateek.

I got back to work today, sorry for delay.
I am trying to go through the mails.
Apologies in case i have missed any bits.

On 5/26/26 7:38 PM, Chen Yu wrote:
Hi Prateek,

On Tue, 26 May 2026 11:23:59 +0530, K Prateek Nayak <[email protected]> 
wrote:
Hello Srikar,

On 5/26/2026 10:28 AM, Srikar Dronamraju wrote:
L2 Cache reported here is for SMT8 Core aka CACHE domain.

Apart for the scheduler, nothing in tree currently cares about
cpu_coregroup_mask() except for drivers/base/arch_topology.c but
Power doesn't select GENERIC_ARCH_TOPOLOGY.

Why can't Power have an internal mask for MC domain (tl_mc_mask) and
the scheduler can use cpu_coregroup_mask() for the actual LLc? (The L2
mask in this case.)

This seems wrong. there is no notion that coregroup_mask
(MC domain) has to point at LLC domain.

For example, on Shared LPAR, there is no MC domain and LLC is at SMT core level.
In that case coregroup_mask has point at SMT mask is wrong.

If we need a mask to point to the LLC mask which arch has to return, then we 
would
need a new api say cpu_llc_mask ? that can point accordingly.

I don't like mixing MC domain and LLC into one bit.


Power anyways adds its own topology via set_sched_topology() so the
default_topology from kernel/sched/topology.c remains unused.

...

Shouldnt cache-aware scheduling be worried about cpuset partitions too.
If a cpuset has subset of LLC cores, then we should Scheduler assume it can
control complete LLC?

Well, the scheduling takes care of partitions and the cache aware
scheduling bits take care of looking at the full system perspective
for stats aggregation and pointing to a particular LLc.

We don't compare llc_id across cpusets so we keeping one unique llc_id
per H/W LLC instance is feasible and it enables us to keep llc_id space
limited for optimizing cache-aware scheduling.

Now if we have threads of same process across partitions, we'll
still aggregate the utilization numbers across the full LLC but
the load balancers at individual partitions will make a call on
the aggregation.

--
Thanks and Regards,
Prateek



I suppose what you suggested looks like below:

powerpc/smp: make cpu_coregroup_mask() return the LLC

On pSeries shared LPARs(or coregroup_enabled is false on
Power9 and earlier) the hemisphere map is not allocated, so
build_sched_domains() dereferences a NULL cpumask and crashes.

The generic scheduler expects cpu_coregroup_mask() to span the LLC.
On powerpc the LLC is the L2. Return cpu_l2_cache_mask() instead of
the hemisphere map. Use a coregroup_map() helper for the in-file
hemisphere users, and a powerpc_tl_mc_mask() wrapper for the MC
sched-domain level.

Fixes: b5ea300a17e3 ("sched/cache: Make LLC id continuous")
Reported-by: Venkat Rao Bagalkote <[email protected]>
Suggested-by: K Prateek Nayak <[email protected]>
---
  arch/powerpc/kernel/smp.c | 35 +++++++++++++++++++++++------------
  1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1040,11 +1040,22 @@ static const struct cpumask 
*tl_smallcore_smt_mask(struct sched_domain_topology_
  }
  #endif
+static inline struct cpumask *coregroup_map(int cpu)
+{
+       return per_cpu(cpu_coregroup_map, cpu);
+}
+
  struct cpumask *cpu_coregroup_mask(int cpu)
  {
-       return per_cpu(cpu_coregroup_map, cpu);
+       return cpu_l2_cache_mask(cpu);
+}

This looks wrong to me too. In different hardware topologies
there maybe distinction between coregroup and l2 mask.

Let me go through the code and see if there is better way.

+
+static const struct cpumask *
+powerpc_tl_mc_mask(struct sched_domain_topology_level *tl, int cpu)
+{
+       return coregroup_map(cpu);
  }
static bool has_coregroup_support(void)
  {
        if (is_shared_processor())
@@ -1155,7 +1166,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
        cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid));
if (has_coregroup_support())
-               cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid));
+               cpumask_set_cpu(boot_cpuid, coregroup_map(boot_cpuid));
init_big_cores();
        if (has_big_cores) {
@@ -1520,8 +1531,8 @@ static void remove_cpu_from_masks(int cpu)
                set_cpus_unrelated(cpu, i, cpu_core_mask);
if (has_coregroup_support()) {
-               for_each_cpu(i, cpu_coregroup_mask(cpu))
-                       set_cpus_unrelated(cpu, i, cpu_coregroup_mask);
+               for_each_cpu(i, coregroup_map(cpu))
+                       set_cpus_unrelated(cpu, i, coregroup_map);
        }
  }
  #endif
@@ -1553,7 +1564,7 @@ static void update_coregroup_mask(int cpu, cpumask_var_t 
*mask)
        if (!*mask) {
                /* Assume only siblings are part of this CPU's coregroup */
                for_each_cpu(i, submask_fn(cpu))
-                       set_cpus_related(cpu, i, cpu_coregroup_mask);
+                       set_cpus_related(cpu, i, coregroup_map);
return;
        }
@@ -1561,18 +1572,18 @@ static void update_coregroup_mask(int cpu, 
cpumask_var_t *mask)
        cpumask_and(*mask, cpu_online_mask, cpu_node_mask(cpu));
/* Update coregroup mask with all the CPUs that are part of submask */
-       or_cpumasks_related(cpu, cpu, submask_fn, cpu_coregroup_mask);
+       or_cpumasks_related(cpu, cpu, submask_fn, coregroup_map);
/* Skip all CPUs already part of coregroup mask */
-       cpumask_andnot(*mask, *mask, cpu_coregroup_mask(cpu));
+       cpumask_andnot(*mask, *mask, coregroup_map(cpu));
for_each_cpu(i, *mask) {
                /* Skip all CPUs not part of this coregroup */
                if (coregroup_id == cpu_to_coregroup_id(i)) {
-                       or_cpumasks_related(cpu, i, submask_fn, 
cpu_coregroup_mask);
+                       or_cpumasks_related(cpu, i, submask_fn, coregroup_map);
                        cpumask_andnot(*mask, *mask, submask_fn(i));
                } else {
-                       cpumask_andnot(*mask, *mask, cpu_coregroup_mask(i));
+                       cpumask_andnot(*mask, *mask, coregroup_map(i));
                }
        }
  }
@@ -1733,7 +1744,7 @@ static void __init build_sched_topology(void)
if (has_coregroup_support()) {
                powerpc_topology[i++] =
-                       SDTL_INIT(tl_mc_mask, powerpc_shared_proc_flags, MC);

I would prefer not do this rename. having tl_mc_mask helps to find the usage 
across
the codebase.

+                       SDTL_INIT(powerpc_tl_mc_mask, 
powerpc_shared_proc_flags, MC);
        }
powerpc_topology[i++] = SDTL_INIT(tl_pkg_mask, powerpc_shared_proc_flags, PKG);


Reply via email to