On Fri, 2018-08-17 at 14:54:39 UTC, Srikar Dronamraju wrote: > On a shared lpar, Phyp will not update the cpu associativity at boot > time. Just after the boot system does recognize itself as a shared lpar and > trigger a request for correct cpu associativity. But by then the scheduler > would have already created/destroyed its sched domains. > > This causes > - Broken load balance across Nodes causing islands of cores. > - Performance degradation esp if the system is lightly loaded > - dmesg to wrongly report all cpus to be in Node 0. > - Messages in dmesg saying borken topology. > - With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity > node sched domain"), can cause rcu stalls at boot up. > > >From a scheduler maintainer's perspective, moving cpus from one node to > another or creating more numa levels after boot is not appropriate > without some notification to the user space. > https://lore.kernel.org/lkml/20150406214558.ga38...@linux.vnet.ibm.com/T/#u > > The sched_domains_numa_masks table which is used to generate cpumasks is > only created at boot time just before creating sched domains and never > updated. Hence, its better to get the topology correct before the sched > domains are created. > > For example on 64 core Power 8 shared lpar, dmesg reports > > [ 2.088360] Brought up 512 CPUs > [ 2.088368] Node 0 CPUs: 0-511 > [ 2.088371] Node 1 CPUs: > [ 2.088373] Node 2 CPUs: > [ 2.088375] Node 3 CPUs: > [ 2.088376] Node 4 CPUs: > [ 2.088378] Node 5 CPUs: > [ 2.088380] Node 6 CPUs: > [ 2.088382] Node 7 CPUs: > [ 2.088386] Node 8 CPUs: > [ 2.088388] Node 9 CPUs: > [ 2.088390] Node 10 CPUs: > [ 2.088392] Node 11 CPUs: > ... > [ 3.916091] BUG: arch topology borken > [ 3.916103] the DIE domain not a subset of the NUMA domain > [ 3.916105] BUG: arch topology borken > [ 3.916106] the DIE domain not a subset of the NUMA domain > ... > > numactl/lscpu output will still be correct with cores spreading across > all nodes. > > Socket(s): 64 > NUMA node(s): 12 > Model: 2.0 (pvr 004d 0200) > Model name: POWER8 (architected), altivec supported > Hypervisor vendor: pHyp > Virtualization type: para > L1d cache: 64K > L1i cache: 32K > NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471 > NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479 > NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487 > NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495 > NUMA node4 CPU(s): 208-215,304-311,400-407,496-503 > NUMA node5 CPU(s): 168-175,264-271,360-367,456-463 > NUMA node6 CPU(s): 128-135,224-231,320-327,416-423 > NUMA node7 CPU(s): 136-143,232-239,328-335,424-431 > NUMA node8 CPU(s): 216-223,312-319,408-415,504-511 > NUMA node9 CPU(s): 144-151,240-247,336-343,432-439 > NUMA node10 CPU(s): 152-159,248-255,344-351,440-447 > NUMA node11 CPU(s): 160-167,256-263,352-359,448-455 > > Currently on this lpar, the scheduler detects 2 levels of Numa and > created numa sched domains for all cpus, but it finds a single DIE > domain consisting of all cpus. Hence it deletes all numa sched domains. > > To address this, detect the shared processor and update topology soon after > cpus are setup so that correct topology is updated just before scheduler > creates sched domain. > > With the fix, dmesg reports > > [ 0.491336] numa: Node 0 CPUs: 0-7 32-39 64-71 96-103 176-183 272-279 > 368-375 464-471 > [ 0.491351] numa: Node 1 CPUs: 8-15 40-47 72-79 104-111 184-191 280-287 > 376-383 472-479 > [ 0.491359] numa: Node 2 CPUs: 16-23 48-55 80-87 112-119 192-199 288-295 > 384-391 480-487 > [ 0.491366] numa: Node 3 CPUs: 24-31 56-63 88-95 120-127 200-207 296-303 > 392-399 488-495 > [ 0.491374] numa: Node 4 CPUs: 208-215 304-311 400-407 496-503 > [ 0.491379] numa: Node 5 CPUs: 168-175 264-271 360-367 456-463 > [ 0.491384] numa: Node 6 CPUs: 128-135 224-231 320-327 416-423 > [ 0.491389] numa: Node 7 CPUs: 136-143 232-239 328-335 424-431 > [ 0.491394] numa: Node 8 CPUs: 216-223 312-319 408-415 504-511 > [ 0.491399] numa: Node 9 CPUs: 144-151 240-247 336-343 432-439 > [ 0.491404] numa: Node 10 CPUs: 152-159 248-255 344-351 440-447 > [ 0.491409] numa: Node 11 CPUs: 160-167 256-263 352-359 448-455 > > and lscpu would also report > > Socket(s): 64 > NUMA node(s): 12 > Model: 2.0 (pvr 004d 0200) > Model name: POWER8 (architected), altivec supported > Hypervisor vendor: pHyp > Virtualization type: para > L1d cache: 64K > L1i cache: 32K > NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471 > NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479 > NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487 > NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495 > NUMA node4 CPU(s): 208-215,304-311,400-407,496-503 > NUMA node5 CPU(s): 168-175,264-271,360-367,456-463 > NUMA node6 CPU(s): 128-135,224-231,320-327,416-423 > NUMA node7 CPU(s): 136-143,232-239,328-335,424-431 > NUMA node8 CPU(s): 216-223,312-319,408-415,504-511 > NUMA node9 CPU(s): 144-151,240-247,336-343,432-439 > NUMA node10 CPU(s): 152-159,248-255,344-351,440-447 > NUMA node11 CPU(s): 160-167,256-263,352-359,448-455 > > Previous attempt to solve this problem > https://patchwork.ozlabs.org/patch/530090/ > > Reported-by: Manjunatha H R <manju...@in.ibm.com> > Signed-off-by: Srikar Dronamraju <sri...@linux.vnet.ibm.com>
Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/2ea62630681027c455117aa471ea3a cheers