+ Matt and Mel. On Wed, Jun 28, 2017 at 03:26:10AM +0700, Suravee Suthikulpanit wrote: > So, from the definition above, we would like all those 16 threads to be in > the same sched-domain, where threads from C0,1,2,3 are in the same > sched-group, and threads in C4,5,6,7 are in another sched-group.
Figures, you want to have a sched group per L3. > > Now that thing has a memory controller attached to it, correct? > > Yes > > > If so, why is this thing not a logical NUMA node, as described in > > SRAT/SLIT? > > Yes, this thing is a logical NUMA node and represented correctly in the > SRAT/SLIT. > > > Now, SRAT should contain the assignment which core belongs to which > > node. Why is that not sufficient? > > Yes, SRAT provides cpu-to-node mapping, which is sufficient to tell > scheduler what are the cpus within a NUMA node. > > However, looking at the current sched-domain below. Notice that there is no > sched-domain with 16 threads to represent a NUMA node: > > cpu0 > domain0 00000000,00000001,00000000,00000001 (SMT) > domain1 00000000,0000000f,00000000,0000000f (MC) > domain2 00000000,ffffffff,00000000,ffffffff (NUMA) > domain3 ffffffff,ffffffff,ffffffff,ffffffff (NUMA) > > sched-domain2 (which represents a sched-domain containing all cpus within a > socket) would have 8 sched-groups (based on the cpumasks from domain1). > According to the documentation snippet above regarding balancing within a > sched-domain, scheduler will try to do (NUMA) load-balance between 8 groups > (spanning 4 NUMA node). Here, IINM, it would be more beneficial if the > scheduler would try to load balance between the two groups within the same > NUMA node first before, going across NUMA node in order to minimize memory > latency. This would require another sched-domain between domain 1 and 2, > which represent all 16 threads within a NUMA node (i.e. die sched-domain), > this would allow scheduler to load balance within the NUMA node first, > before going across NUMA node. > > However, since the current code decides that x86_has_numa_in_package is > true, it omits the die sched-domain. In order to avoid this, we are > proposing to represent cpuinfo_x86.phys_proc_id using NUMA node ID (i.e. die > ID). And this is the main point of the patch series. Ok, so here's what I'm reading from all this: a 16-thread package *is* represented as a logical NUMA node properly by SRAT/SLIT, after all. However, sched_init_numa() doesn't parse it properly or something else along that path keeps the scheduler from seeing the 4 NUMA nodes on the physical socket. Because we do use the SLIT table to build the NUMA domains. So considering that 16T package really is represented properly as a logical NUMA node, why doesn't it get detected as such? This is what needs asnwering first and not some forceful fitting the topology into a DIE domain. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.