Matt, On 7/23/2019 5:48 AM, Matt Fleming wrote: > SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init() > for any sched domains with a NUMA distance greater than 2 hops > (RECLAIM_DISTANCE). The idea being that it's expensive to balance > across domains that far apart. > > However, as is rather unfortunately explained in > > commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30") > > the value for RECLAIM_DISTANCE is based on node distance tables from > 2011-era hardware. > > Current AMD EPYC machines have the following NUMA node distances: > > node distances: > node 0 1 2 3 4 5 6 7 > 0: 10 16 16 16 32 32 32 32 > 1: 16 10 16 16 32 32 32 32 > 2: 16 16 10 16 32 32 32 32 > 3: 16 16 16 10 32 32 32 32 > 4: 32 32 32 32 10 16 16 16 > 5: 32 32 32 32 16 10 16 16 > 6: 32 32 32 32 16 16 10 16 > 7: 32 32 32 32 16 16 16 10 > > where 2 hops is 32. > > The result is that the scheduler fails to load balance properly across > NUMA nodes on different sockets -- 2 hops apart. > > For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4 > (CPUs 32-39) like so, > > $ numactl -C 0-7,32-39 ./spinner 16 > > causes all threads to fork and remain on node 0 until the active > balancer kicks in after a few seconds and forcibly moves some threads > to node 4.
I am testing this patch on the Linux-5.2, and I actually do not notice difference pre vs post patch. Besides the case above, I have also run an experiment with a different number of threads across two sockets: (Note: I only focus on thread0 of each core.) sXnY = Socket X Node Y * s0n0 + s0n1 + s1n0 + s1n1 numactl -C 0-15,32-47 ./spinner 32 * s0n2 + s0n3 + s1n2 + s1n3 numactl -C 16-31,48-63 ./spinner 32 * s0 + s1 numactl -C 0-63 ./spinner 64 My observations are: * I still notice improper load-balance on one of the task initially for a few seconds before they are load-balanced correctly. * It is taking longer to load balance w/ more number of tasks. I wonder if you have tried with a different kernel base? Regards, Suravee