On 12/03/14 13:47, Vincent Guittot wrote: > On 12 March 2014 14:28, Dietmar Eggemann <dietmar.eggem...@arm.com> wrote: >> On 11/03/14 13:17, Peter Zijlstra wrote: >>> On Sat, Mar 08, 2014 at 12:40:58PM +0000, Dietmar Eggemann wrote: >>>>> >>>>> I don't have a strong opinion about using or not a cpu argument for >>>>> setting the flags of a level (it was part of the initial proposal >>>>> before we start to completely rework the build of sched_domain) >>>>> Nevertheless, I see one potential concern that you can have completely >>>>> different flags configuration of the same sd level of 2 cpus. >>>> >>>> Could you elaborate a little bit further regarding the last sentence? Do >>>> you >>>> think that those completely different flags configuration would make it >>>> impossible, that the load-balance code could work at all at this sd? >>> >>> So a problem with such an interfaces is that is makes it far too easy to >>> generate completely broken domains. >> >> I see the point. What I'm still struggling with is to understand why >> this interface is worse then the one where we set-up additional, >> adjacent sd levels with new cpu_foo_mask functions plus different static >> sd-flags configurations and rely on the sd degenerate functionality in >> the core scheduler to fold these levels together to achieve different >> per cpu sd flags configurations. > > The main difference is that all CPUs has got the same levels at the > initial state and then the degenerate sequence can decide that it's > worth removing a level and if it will not create unsuable domains. >
Agreed. But what I'm trying to say is that using the approach of multiple adjacent sd levels with different cpu_mask(int cpu) functions and static sd topology flags will not prevent us from coding the enforcement of sane sd topology flags set-ups somewhere inside the core scheduler. It is possible to easily introduce erroneous set-ups from the standpoint of sd topology flags with this approach too. For the sake of an example on ARM TC2 platform, I changed cpu_corepower_mask(int cpu) [arch/arm/kernel/topology.c] to simulate that in socket 1 (3 Cortex-A7) cores can powergate individually whereas in socket 0 (2 Cortex A15) they can't: const struct cpumask *cpu_corepower_mask(int cpu) { - return &cpu_topology[cpu].thread_sibling; + return cpu_topology[cpu].socket_id ? &cpu_topology[cpu].thread_sibling : + &cpu_topology[cpu].core_sibling; } With this I get the following cpu mask configuration: dmesg snippet (w/ additional debug in cpu_coregroup_mask(), cpu_corepower_mask()): ... CPU0: cpu_corepower_mask=0-1 CPU0: cpu_coregroup_mask=0-1 CPU1: cpu_corepower_mask=0-1 CPU1: cpu_coregroup_mask=0-1 CPU2: cpu_corepower_mask=2 CPU2: cpu_coregroup_mask=2-4 CPU3: cpu_corepower_mask=3 CPU3: cpu_coregroup_mask=2-4 CPU4: cpu_corepower_mask=4 CPU4: cpu_coregroup_mask=2-4 ... And I deliberately introduced the following error into the arm_topology[] table: static struct sched_domain_topology_level arm_topology[] = { #ifdef CONFIG_SCHED_MC - { cpu_corepower_mask, SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN, SD_INIT_NAME(GMC) }, + { cpu_corepower_mask, SD_SHARE_POWERDOMAIN, SD_INIT_NAME(GMC) }, { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) }, With this set-up, I get GMC & DIE level for CPU0,1 and MC & DIE level for CPU2,3,4, i.e. the SD_SHARE_PKG_RESOURCES flag is only set for CPU2,3,4 and MC level. dmesg snippet (w/ adapted sched_domain_debug_one(), only CPU0 and CPU2 shown here): ... CPU0 attaching sched-domain: domain 0: span 0-1 level GMC SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_POWERDOMAIN SD_PREFER_SIBLING groups: 0 1 ... domain 1: span 0-4 level DIE SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING groups: 0-1 (cpu_power = 2048) 2-4 (cpu_power = 3072) ... CPU2 attaching sched-domain: domain 0: span 2-4 level MC SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES groups: 2 3 4 ... domain 1: span 0-4 level DIE SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING groups: 2-4 (cpu_power = 3072) 0-1 (cpu_power = 2048) ... What I wanted to say is IMHO, it doesn't matter which approach we take (multiple adjacent sd levels or per-cpu topo sd flag function), we have to enforce sane sd topology flags set-up inside the core scheduler anyway. -- Dietmar >> >> IMHO, exposing struct sched_domain_topology_level bar_topology[] to the >> arch is the reason why the core scheduler has to check if the arch >> provides a sane sd setup in both cases. >> >>> >>> You can, for two cpus in the same domain provide, different flags; such >>> a configuration doesn't make any sense at all. >>> >>> Now I see why people would like to have this; but unless we can make it >>> robust I'd be very hesitant to go this route. >>> >> >> By making it robust, I guess you mean that the core scheduler has to >> check that the provided set-ups are sane, something like the following >> code snippet in sd_init() >> >> if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS, >> "wrong sd_flags in topology description\n")) >> tl->sd_flags &= ~TOPOLOGY_SD_FLAGS; >> >> but for per cpu set-up's. >> Obviously, this check has to be in sync with the usage of these flags in >> the core scheduler algorithms. This comprises probably that a subset of >> these topology sd flags has to be set for all cpus in a sd level whereas >> other can be set only for some cpus. [...] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/