On 8/10/17 23:41, Peter Zijlstra wrote:
On Thu, Aug 10, 2017 at 10:20:52AM -0500, Suravee Suthikulpanit wrote:
On AMD Family17h-based (EPYC) system, a NUMA node can contain
upto 8 cores (16 threads) with the following topology.

             ----------------------------
         C0  | T0 T1 |    ||    | T0 T1 | C4
             --------|    ||    |--------
         C1  | T0 T1 | L3 || L3 | T0 T1 | C5
             --------|    ||    |--------
         C2  | T0 T1 | #0 || #1 | T0 T1 | C6
             --------|    ||    |--------
         C3  | T0 T1 |    ||    | T0 T1 | C7
             ----------------------------

Here, there are 2 last-level (L3) caches per NUMA node. A socket can
contain upto 4 NUMA nodes, and a system can support upto 2 sockets.
With full system configuration, current scheduler creates 4 sched
domains:

  domain0 SMT       (span a core)
  domain1 MC        (span a last-level-cache)

Right, so traditionally we'd have the DIE level do that, but because
x86_has_numa_in_package we don't generate that, right?

That's correct.


  domain2 NUMA      (span a socket: 4 nodes)
  domain3 NUMA      (span a system: 8 nodes)

Note that there is no domain to represent cpus spaning a NUMA node.
With this hierachy of sched domains, the scheduler does not balance
properly in the following cases:

Case1:
When running 8 tasks, a properly balanced system should
schedule a task per NUMA node. This is not the case for
the current scheduler.

Case2:
When running 'taskset -c 0-7 <a_program_with_8_independent_threads>',
a properly balanced system should schedule 8 threads on 8 cpus
(e.g. T0 of C0-C7).  However, current scheduler would schedule
some threads on the same cpu, while others are idle.

Sure.. could you amend with a few actual performance numbers?

Sure.

[...]
@@ -1445,9 +1448,24 @@ void sched_init_numa(void)
                tl[i] = sched_domain_topology[i];

        /*
+        * Ignore the NUMA identity level if it has the same cpumask
+        * as previous level. This is the case for:
+        *   - System with last-level-cache (MC) sched domain span a NUMA node.
+        *   - System with DIE sched domain span a NUMA node.
+        *
+        * Assume all NUMA nodes are identical, so only check node 0.
+        */
+       if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0)))
+               tl[i++] = (struct sched_domain_topology_level){
+                       .mask = sd_numa_mask,
+                       .numa_level = 0,
+                       SD_INIT_NAME(NUMA_IDEN)

Shall we make that:

                        SD_INIT_NAME(NODE),

instead?

Sounds good.

+               };

This misses a set of '{}'. While C doesn't require it, out coding style
warrants blocks around any multi-line statement.

So what you've forgotten to mention is that for those systems where the
LLC == NODE this now superfluous level gets removed by the degenerate
code. Have you verified that does the right thing?

Let me check with that one and get back.

Thanks,
Suravee

Reply via email to