Hi Mel, On 2020/12/9 22:36, Mel Gorman wrote: > On Wed, Dec 09, 2020 at 02:24:04PM +0800, Aubrey Li wrote: >> Add idle cpumask to track idle cpus in sched domain. Every time >> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup >> target. And if the CPU is not in idle, the CPU is cleared in idle >> cpumask during scheduler tick to ratelimit idle cpumask update. >> >> When a task wakes up to select an idle cpu, scanning idle cpumask >> has lower cost than scanning all the cpus in last level cache domain, >> especially when the system is heavily loaded. >> >> Benchmarks including hackbench, schbench, uperf, sysbench mysql >> and kbuild were tested on a x86 4 socket system with 24 cores per >> socket and 2 hyperthreads per core, total 192 CPUs, no regression >> found. >> > > I ran this patch with tbench on top of of the schedstat patches that > track SIS efficiency. The tracking adds overhead so it's not a perfect > performance comparison but the expectation would be that the patch reduces > the number of runqueues that are scanned
Thanks for the measurement! I don't play with tbench so may need a while to digest the data. > > tbench4 > 5.10.0-rc6 5.10.0-rc6 > schedstat-v1r1 idlemask-v7r1 > Hmean 1 504.76 ( 0.00%) 500.14 * -0.91%* > Hmean 2 1001.22 ( 0.00%) 970.37 * -3.08%* > Hmean 4 1930.56 ( 0.00%) 1880.96 * -2.57%* > Hmean 8 3688.05 ( 0.00%) 3537.72 * -4.08%* > Hmean 16 6352.71 ( 0.00%) 6439.53 * 1.37%* > Hmean 32 10066.37 ( 0.00%) 10124.65 * 0.58%* > Hmean 64 12846.32 ( 0.00%) 11627.27 * -9.49%* > Hmean 128 22278.41 ( 0.00%) 22304.33 * 0.12%* > Hmean 256 21455.52 ( 0.00%) 20900.13 * -2.59%* > Hmean 320 21802.38 ( 0.00%) 21928.81 * 0.58%* > > Not very optimistic result. The schedstats indicate; How many client threads was the following schedstats collected? > > 5.10.0-rc6 5.10.0-rc6 > schedstat-v1r1 idlemask-v7r1 > Ops TTWU Count 5599714302.00 5589495123.00 > Ops TTWU Local 2687713250.00 2563662550.00 > Ops SIS Search 5596677950.00 5586381168.00 > Ops SIS Domain Search 3268344934.00 3229088045.00 > Ops SIS Scanned 15909069113.00 16568899405.00 > Ops SIS Domain Scanned 13580736097.00 14211606282.00 > Ops SIS Failures 2944874939.00 2843113421.00 > Ops SIS Core Search 262853975.00 311781774.00 > Ops SIS Core Hit 185189656.00 216097102.00 > Ops SIS Core Miss 77664319.00 95684672.00 > Ops SIS Recent Used Hit 124265515.00 146021086.00 > Ops SIS Recent Used Miss 338142547.00 403547579.00 > Ops SIS Recent Attempts 462408062.00 549568665.00 > Ops SIS Search Efficiency 35.18 33.72 > Ops SIS Domain Search Eff 24.07 22.72 > Ops SIS Fast Success Rate 41.60 42.20 > Ops SIS Success Rate 47.38 49.11 > Ops SIS Recent Success Rate 26.87 26.57 > > The field I would expect to decrease is SIS Domain Scanned -- the number > of runqueues that were examined but it's actually worse and graphing over > time shows it's worse for the client thread counts. select_idle_cpu() > is definitely being called because "Domain Search" is 10 times higher than > "Core Search" and there "Core Miss" is non-zero. Why SIS Domain Scanned can be decreased? I thought SIS Scanned was supposed to be decreased but it seems not on your side. I printed some trace log on my side by uperf workload, and it looks properly. To make the log easy to read, I started a 4 VCPU VM to run 2-second uperf 8 threads. stage 1: system idle, update_idle_cpumask is called from idle thread, set cpumask to 0-3 ======================================================================================== <idle>-0 [002] d..1 137.408681: update_idle_cpumask: set_idle-1, cpumask: 2 <idle>-0 [000] d..1 137.408713: update_idle_cpumask: set_idle-1, cpumask: 0,2 <idle>-0 [003] d..1 137.408924: update_idle_cpumask: set_idle-1, cpumask: 0,2-3 <idle>-0 [001] d..1 137.409035: update_idle_cpumask: set_idle-1, cpumask: 0-3 stage 2: uperf ramp up, cpumask changes back and forth ======================================================== uperf-561 [003] d..3 137.410620: select_task_rq_fair: scanning: 0-3 uperf-560 [000] d..5 137.411384: select_task_rq_fair: scanning: 0-3 kworker/u8:3-110 [000] d..4 137.411436: select_task_rq_fair: scanning: 0-3 uperf-560 [000] d.h1 137.412562: update_idle_cpumask: set_idle-0, cpumask: 1-3 uperf-570 [002] d.h2 137.412580: update_idle_cpumask: set_idle-0, cpumask: 1,3 <idle>-0 [002] d..1 137.412917: update_idle_cpumask: set_idle-1, cpumask: 1-3 <idle>-0 [000] d..1 137.413004: update_idle_cpumask: set_idle-1, cpumask: 0-3 uperf-560 [000] d..5 137.415856: select_task_rq_fair: scanning: 0-3 kworker/u8:3-110 [001] d..4 137.415956: select_task_rq_fair: scanning: 0-3 sshd-504 [003] d.h1 137.416562: update_idle_cpumask: set_idle-0, cpumask: 0-2 uperf-560 [000] d.h1 137.416598: update_idle_cpumask: set_idle-0, cpumask: 1-2 <idle>-0 [003] d..1 137.416638: update_idle_cpumask: set_idle-1, cpumask: 1-3 <idle>-0 [000] d..1 137.417076: update_idle_cpumask: set_idle-1, cpumask: 0-3 tmux: server-528 [001] d.h. 137.448566: update_idle_cpumask: set_idle-0, cpumask: 0,2-3 <idle>-0 [001] d..1 137.448980: update_idle_cpumask: set_idle-1, cpumask: 0-3 stage 3: uperf running, select_idle_cpu scan all the CPUs in the scheduler domain at the beginning =================================================================================================== uperf-560 [000] d..3 138.418494: select_task_rq_fair: scanning: 0-3 uperf-560 [000] d..3 138.418506: select_task_rq_fair: scanning: 0-3 uperf-560 [000] d..3 138.418514: select_task_rq_fair: scanning: 0-3 uperf-560 [000] dN.3 138.418534: select_task_rq_fair: scanning: 0-3 uperf-560 [000] dN.3 138.418543: select_task_rq_fair: scanning: 0-3 uperf-560 [000] dN.3 138.418551: select_task_rq_fair: scanning: 0-3 uperf-561 [003] d..3 138.418577: select_task_rq_fair: scanning: 0-3 uperf-561 [003] d..3 138.418600: select_task_rq_fair: scanning: 0-3 uperf-561 [003] d..3 138.418617: select_task_rq_fair: scanning: 0-3 uperf-561 [003] d..3 138.418640: select_task_rq_fair: scanning: 0-3 uperf-561 [003] d..3 138.418652: select_task_rq_fair: scanning: 0-3 uperf-561 [003] d..3 138.418662: select_task_rq_fair: scanning: 0-3 uperf-561 [003] d..3 138.418672: select_task_rq_fair: scanning: 0-3 uperf-560 [000] d..5 138.418676: select_task_rq_fair: scanning: 0-3 uperf-561 [003] d..3 138.418693: select_task_rq_fair: scanning: 0-3 kworker/u8:3-110 [002] d..4 138.418746: select_task_rq_fair: scanning: 0-3 stage 4: scheduler tick comes, update idle cpumask to EMPTY ============================================================ uperf-572 [002] d.h. 139.420568: update_idle_cpumask: set_idle-0, cpumask: 1,3 uperf-574 [000] d.H2 139.420568: update_idle_cpumask: set_idle-0, cpumask: 1,3 uperf-565 [003] d.H6 139.420569: update_idle_cpumask: set_idle-0, cpumask: 1 tmux: server-528 [001] d.h2 139.420572: update_idle_cpumask: set_idle-0, cpumask: stage 5: uperf continue running, select_idle_cpu does not scan idle cpu ======================================================================== <I only run 2 seconds uperf, during this two seconds, no idle cpu in cpumask to scan> uperf-565 [003] d.sa 139.420587: select_task_rq_fair: scanning: uperf-572 [002] d.sa 139.420670: select_task_rq_fair: scanning: ............ uperf-561 [003] d..5 141.421620: select_task_rq_fair: scanning: uperf-571 [001] d.sa 141.421630: select_task_rq_fair: scanning: stage 6: uperf benchmark finished, idle thread switch on ========================================================= <idle>-0 [002] d..1 141.421631: update_idle_cpumask: set_idle-1, cpumask: 2 <idle>-0 [000] d..1 141.421654: update_idle_cpumask: set_idle-1, cpumask: 0,2 <idle>-0 [001] d..1 141.421665: update_idle_cpumask: set_idle-1, cpumask: 0-2 uperf-561 [003] d..5 141.421712: select_task_rq_fair: scanning: 0-2 <idle>-0 [003] d..1 141.421807: update_idle_cpumask: set_idle-1, cpumask: 0-3 stage 7: uperf ramp down ========================== uperf-560 [000] d..5 141.423075: select_task_rq_fair: scanning: 0-3 uperf-560 [000] d..5 141.423107: select_task_rq_fair: scanning: 0-3 uperf-560 [000] d..5 141.423259: select_task_rq_fair: scanning: 0-3 tmux: server-528 [002] d..5 141.730489: select_task_rq_fair: scanning: 1-3 kworker/u8:1-96 [003] d..4 141.731924: select_task_rq_fair: scanning: 1-3 <idle>-0 [000] d..1 141.734560: update_idle_cpumask: set_idle-1, cpumask: 0-3 tmux: server-528 [002] d.h1 141.736568: update_idle_cpumask: set_idle-0, cpumask: 0-1 uperf.sh-558 [003] d.h. 141.736570: update_idle_cpumask: set_idle-0, cpumask: 0-1 <idle>-0 [002] d..1 141.736718: update_idle_cpumask: set_idle-1, cpumask: 0-2 <idle>-0 [003] d..1 141.738179: update_idle_cpumask: set_idle-1, cpumask: 0-3 pkill-578 [001] d.h1 141.740569: update_idle_cpumask: set_idle-0, cpumask: 0,2-3 <idle>-0 [001] d..1 141.741875: update_idle_cpumask: set_idle-1, cpumask: 0-3 pkill-578 [001] d.h. 141.744570: update_idle_cpumask: set_idle-0, cpumask: 0,2-3 pkill-578 [001] d..6 141.770012: select_task_rq_fair: scanning: 0,2-3 <idle>-0 [001] d..1 141.770938: update_idle_cpumask: set_idle-1, cpumask: 0-3 In this case, SIS Scanned should be decreased. I'll apply your schedstat patch to see if the data matches. > > I suspect the issue is that the mask is only marked busy from the tick > context which is a very wide window. If select_idle_cpu() picks an idle > CPU from the mask, it's still marked as idle in the mask. > That should be fine because we still check available_idle_cpu() and sched_idle_cpu for the selected CPU. And even if that CPU is marked as idle in the mask, that mask should not be worse(bigger) than the default sched_domain_span(sd). Thanks, -Aubrey