On Fri, Apr 28, 2017 at 03:20:12PM +0200, Peter Zijlstra wrote: > +/* > + * NUMA topology (first read the regular topology blurb below) > + * > + * Given a node-distance table, for example: > + * > + * node 0 1 2 3 > + * 0: 10 20 30 20 > + * 1: 20 10 20 30 > + * 2: 30 20 10 20 > + * 3: 20 30 20 10 > + * > + * which represents a 4 node ring topology like: > + * > + * 0 ----- 1 > + * | | > + * | | > + * | | > + * 3 ----- 2 > + * > + * We want to construct domains and groups to represent this. The way we go > + * about doing this is to build the domains on 'hops'. For each NUMA level we > + * construct the mask of all nodes reachable in @level hops. > + * > + * For the above NUMA topology that gives 3 levels: > + * > + * NUMA-2 0-3 0-3 0-3 0-3 > + * groups: {0-1,3},{1-3} {0-2},{0,2-3} {1-3},{0-1,3} {0,2-3},{0-2} > + * > + * NUMA-1 0-1,3 0-2 1-3 0,2-3 > + * groups: {0},{1},{3} {0},{1},{2} {1},{2},{3} {0},{2},{3} > + * > + * NUMA-0 0 1 2 3 > + * > + * > + * As can be seen; things don't nicely line up as with the regular topology. > + * When we iterate a domain in child domain chunks some nodes can be > + * represented multiple times -- hence the "overlap" naming for this part of > + * the topology. > + * > + * In order to minimize this overlap, we only build enough groups to cover > the > + * domain. For instance Node-0 NUMA-2 would only get groups: 0-1,3 and 1-3. > + * > + * Because: > + * > + * - the first group of each domain is its child domain; this > + * gets us the first 0-1,3 > + * - the only uncovered node is 2, who's child domain is 1-3. > + * > + * However, because of the overlap, computing a unique CPU for each group is > + * more complicated. Consider for instance the groups of NODE-1 NUMA-2, both > + * groups include the CPUs of Node-0, while those CPUs would not in fact ever > + * end up at those groups (they would end up in group: 0-1,3). > + * > + * To correct this we have to introduce the group iteration mask. This mask > + * will contain those CPUs in the group that can reach this group given the > + * (child) domain tree. > + * > + * With this we can once again compute balance_cpu and sched_group_capacity > + * relations. > + * > + * XXX include words on how balance_cpu is unique and therefore can be > + * used for sched_group_capacity links. > + */ > +
I added the below to clarify some of the asymmetric comments we have. --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -571,6 +571,37 @@ int group_balance_cpu(struct sched_group * * XXX include words on how balance_cpu is unique and therefore can be * used for sched_group_capacity links. + * + * + * Another 'interesting' topology is: + * + * node 0 1 2 3 + * 0: 10 20 20 30 + * 1: 20 10 20 20 + * 2: 20 20 10 20 + * 3: 30 20 20 10 + * + * Which looks a little like: + * + * 0 ----- 1 + * | / | + * | / | + * | / | + * 2 ----- 3 + * + * This topology is asymmetric, nodes 1,2 are fully connected, but nodes 0,3 + * are not. + * + * This leads to a few particularly weird cases where the sched_domain's are + * not of the same number for each cpu. Consider: + * + * NUMA-2 0-3 0-3 + * groups: {0-2},{1-3} {1-3},{0-2} + * + * NUMA-1 0-2 0-3 0-3 1-3 + * + * NUMA-0 0 1 2 3 + * */