Re: [LKP] [bpf] fd978bf7fd: will-it-scale.per_process_ops -4.0% regression

Daniel Borkmann Fri, 02 Nov 2018 01:37:24 -0700

Hi Rong,

On 11/02/2018 03:14 AM, kernel test robot wrote:
> Greeting,
> 
> FYI, we noticed a -4.0% regression of will-it-scale.per_process_ops due to 
> commit:
> 
> 
> commit: fd978bf7fd312581a7ca454a991f0ffb34c4204b ("bpf: Add reference 
> tracking to verifier")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> 
> in testcase: will-it-scale
> on test machine: 80 threads Skylake with 64G memory
> with following parameters:
> 
>       nr_task: 100%
>       mode: process
>       test: mmap1
>       cpufreq_governor: performance


Hmm, so the test cases you are running are these ones:

  https://github.com/antonblanchard/will-it-scale/blob/master/tests/mmap1.c
  https://github.com/antonblanchard/will-it-scale/blob/master/tests/mmap2.c

The commit from Joe referenced above only adds a feature to the (eBPF) 
verifier. Looking
through will-it-scale test suite, looks like there's neither cBPF nor eBPF in 
use and if
it would have been the former (e.g. via seccomp BPF), then also this has no 
effect on it
since this doesn't load through bpf(2); meaning if so then something must use 
eBPF here,
but then it's also unclear right now how this would even remotely affect mmap() 
test
performance by -4%. Hm, are you certain it's not a false bisection? If so, what 
else is
loading eBPF on your machine in parallel when you run the tests?

Thanks,
Daniel

> test-description: Will It Scale takes a testcase and runs it from 1 through 
> to n parallel copies to see if the testcase will scale. It builds both a 
> process and threads based test in order to see any differences between the 
> two.
> test-url: https://github.com/antonblanchard/will-it-scale
> 
> In addition to that, the commit also has significant impact on the following 
> tests:
> 
> +------------------+---------------------------------------------------------------+
> | testcase: change | will-it-scale: will-it-scale.per_process_ops -3.8% 
> regression |
> | test machine     | 80 threads Skylake with 64G memory                       
>      |
> | test parameters  | cpufreq_governor=performance                             
>      |
> |                  | mode=process                                             
>      |
> |                  | nr_task=100%                                             
>      |
> |                  | test=mmap2                                               
>      |
> +------------------+---------------------------------------------------------------+
> 
> 
> Details are as below:
> -------------------------------------------------------------------------------------------------->
> 
> 
> To reproduce:
> 
>         git clone https://github.com/intel/lkp-tests.git
>         cd lkp-tests
>         bin/lkp install job.yaml  # job file is attached in this email
>         bin/lkp run     job.yaml
> 
> =========================================================================================
> compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
>   
> gcc-7/performance/x86_64-rhel-7.2/process/100%/debian-x86_64-2018-04-03.cgz/lkp-skl-2sp2/mmap1/will-it-scale
> 
> commit: 
>   84dbf35073 ("bpf: Macrofy stack state copy")
>   fd978bf7fd ("bpf: Add reference tracking to verifier")
> 
> 84dbf3507349696b fd978bf7fd312581a7ca454a99 
> ---------------- -------------------------- 
>          %stddev     %change         %stddev
>              \          |                \  
>      16811            -4.0%      16140        will-it-scale.per_process_ops
>    1344946            -4.0%    1291230        will-it-scale.workload
>     107.75 ± 38%    +252.4%     379.75 ± 93%  cpuidle.POLL.usage
>     121.70 ± 18%     +18.9%     144.70 ±  4%  
> sched_debug.cfs_rq:/.exec_clock.stddev
>       4933            +2.0%       5031        proc-vmstat.nr_inactive_anon
>       4933            +2.0%       5031        
> proc-vmstat.nr_zone_inactive_anon
>       9874            +9.0%      10765 ±  7%  
> slabinfo.proc_inode_cache.active_objs
>       9874            +9.0%      10765 ±  7%  
> slabinfo.proc_inode_cache.num_objs
>      12248 ± 50%     +52.2%      18640 ±  2%  numa-meminfo.node0.Inactive
>      33943 ±  8%     +16.2%      39453 ±  6%  numa-meminfo.node0.SReclaimable
>      91549 ±  9%      -9.9%      82530 ±  7%  numa-meminfo.node1.Slab
>      18091 ± 15%     +29.2%      23382 ± 17%  numa-vmstat.node0
>       3027 ± 52%     +52.6%       4620 ±  3%  
> numa-vmstat.node0.nr_inactive_anon
>       8485 ±  8%     +16.2%       9862 ±  6%  
> numa-vmstat.node0.nr_slab_reclaimable
>       3027 ± 52%     +52.6%       4620 ±  3%  
> numa-vmstat.node0.nr_zone_inactive_anon
>    1.4e+12            -2.5%  1.364e+12        perf-stat.branch-instructions
>      41.42            +0.7       42.15        perf-stat.cache-miss-rate%
>  2.166e+10            -2.1%   2.12e+10        perf-stat.cache-references
>      12.16            +2.7%      12.49        perf-stat.cpi
>  1.741e+12            -2.8%  1.692e+12        perf-stat.dTLB-loads
>       0.00 ±  3%      +0.0        0.00 ±  9%  perf-stat.dTLB-store-miss-rate%
>  5.713e+11            -3.9%   5.49e+11        perf-stat.dTLB-stores
>  6.103e+12            -2.6%  5.943e+12        perf-stat.instructions
>       0.08            -2.6%       0.08        perf-stat.ipc
>  1.954e+09            -1.8%  1.919e+09        perf-stat.node-load-misses
>    4538060            +1.4%    4602862        perf-stat.path-length
>      49.62            -0.5       49.14        
> perf-profile.calltrace.cycles-pp.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      47.64            -0.5       47.17        
> perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.percpu_counter_add_batch.do_munmap.vm_munmap.__x64_sys_munmap
>      47.49            -0.5       47.02        
> perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.percpu_counter_add_batch.do_munmap.vm_munmap
>      49.99            -0.5       49.53        
> perf-profile.calltrace.cycles-pp.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      49.96            -0.5       49.51        
> perf-profile.calltrace.cycles-pp.vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      48.02            -0.4       47.58        
> perf-profile.calltrace.cycles-pp.percpu_counter_add_batch.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64
>       1.41            -0.0        1.37        
> perf-profile.calltrace.cycles-pp.unmap_region.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64
>      47.73            +0.4       48.11        
> perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.percpu_counter_add_batch.__vm_enough_memory.mmap_region
>      47.85            +0.4       48.25        
> perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.percpu_counter_add_batch.__vm_enough_memory.mmap_region.do_mmap
>      48.28            +0.4       48.68        
> perf-profile.calltrace.cycles-pp.__vm_enough_memory.mmap_region.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff
>      48.23            +0.4       48.63        
> perf-profile.calltrace.cycles-pp.percpu_counter_add_batch.__vm_enough_memory.mmap_region.do_mmap.vm_mmap_pgoff
>      48.96            +0.4       49.41        
> perf-profile.calltrace.cycles-pp.mmap_region.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64
>      49.11            +0.5       49.56        
> perf-profile.calltrace.cycles-pp.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      49.24            +0.5       49.70        
> perf-profile.calltrace.cycles-pp.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      49.25            +0.5       49.72        
> perf-profile.calltrace.cycles-pp.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      49.62            -0.5       49.15        
> perf-profile.children.cycles-pp.do_munmap
>      49.99            -0.5       49.53        
> perf-profile.children.cycles-pp.__x64_sys_munmap
>      49.97            -0.5       49.51        
> perf-profile.children.cycles-pp.vm_munmap
>       0.51 ±  2%      -0.0        0.46        
> perf-profile.children.cycles-pp.___might_sleep
>       1.16            -0.0        1.11        
> perf-profile.children.cycles-pp.unmap_vmas
>       1.15            -0.0        1.10        
> perf-profile.children.cycles-pp.unmap_page_range
>       1.41            -0.0        1.37        
> perf-profile.children.cycles-pp.unmap_region
>       0.32 ±  2%      +0.0        0.34 ±  2%  
> perf-profile.children.cycles-pp.up_write
>       0.32 ±  2%      +0.0        0.34        
> perf-profile.children.cycles-pp.vm_area_alloc
>       0.29            +0.0        0.32        
> perf-profile.children.cycles-pp.kmem_cache_alloc
>      48.28            +0.4       48.68        
> perf-profile.children.cycles-pp.__vm_enough_memory
>      48.96            +0.4       49.41        
> perf-profile.children.cycles-pp.mmap_region
>      49.11            +0.5       49.56        
> perf-profile.children.cycles-pp.do_mmap
>      49.25            +0.5       49.71        
> perf-profile.children.cycles-pp.vm_mmap_pgoff
>      49.25            +0.5       49.72        
> perf-profile.children.cycles-pp.ksys_mmap_pgoff
>       0.47 ±  3%      -0.0        0.43        
> perf-profile.self.cycles-pp.___might_sleep
>       0.32 ±  3%      +0.0        0.34 ±  2%  
> perf-profile.self.cycles-pp.up_write
>       0.27            +0.0        0.30        
> perf-profile.self.cycles-pp.kmem_cache_alloc
>       0.49            +0.0        0.53        
> perf-profile.self.cycles-pp.percpu_counter_add_batch
> 
> 
>                                                                               
>   
>                             will-it-scale.per_process_ops                     
>   
>                                                                               
>   
>   18000 +-+-----------------------------------------------------------------+ 
>   
>         |                                                                   | 
>   
>   17500 +-+   +.+                                                           | 
>   
>         |+.+++  :            +.++++.+++           ++++.++++.++              | 
>   
>         |        :++. +      :        :          :            :             | 
>   
>   17000 +-+      +   + ++.++:          :  ++.+++ :            ++.+ ++.   +. | 
>   
>         |                   +          +.+      +                 +   +++  +| 
>   
>   16500 +-+                                                                 | 
>   
>         |              O  OOOO OOOO O O                                     | 
>   
>   16000 +-+           O O            O O O                                  | 
>   
>         |                                                                   | 
>   
>         O     O  OOO O                                                      | 
>   
>   15500 +O+OOO  O                                                           | 
>   
>         |                                                                   | 
>   
>   15000 +-+-----------------------------------------------------------------+ 
>   
>                                                                               
>   
>                                                                               
>                                                                               
>     
>                                 will-it-scale.workload                        
>   
>                                                                               
>   
>   1.42e+06 +-+--------------------------------------------------------------+ 
>   
>    1.4e+06 +-+   ++                                                         | 
>   
>            |++.++ :                 ++.                   +++.+             | 
>   
>   1.38e+06 +-+     :           +.+++   ++          ++++.++    :             | 
>   
>   1.36e+06 +-+     +.+++++.    :         :         :           :+           | 
>   
>            |               ++++          ++.+++++.+            + ++.+++++.++| 
>   
>   1.34e+06 +-+                                                              | 
>   
>   1.32e+06 +-+                                                              | 
>   
>    1.3e+06 +-+               O                                              | 
>   
>            |            OO OO OO OOOOO  OOO                                 | 
>   
>   1.28e+06 +-+                         O                                    | 
>   
>   1.26e+06 +-+       O                                                      | 
>   
>            O     O O  OO                                                    | 
>   
>   1.24e+06 +OO OO O                                                         | 
>   
>   1.22e+06 +-+--------------------------------------------------------------+ 
>   
>                                                                               
>   
>                                                                               
>   
> [*] bisect-good sample
> [O] bisect-bad  sample
> 
> ***************************************************************************************************
> lkp-skl-2sp2: 80 threads Skylake with 64G memory
> =========================================================================================
> compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
>   
> gcc-7/performance/x86_64-rhel-7.2/process/100%/debian-x86_64-2018-04-03.cgz/lkp-skl-2sp2/mmap2/will-it-scale
> 
> commit: 
>   84dbf35073 ("bpf: Macrofy stack state copy")
>   fd978bf7fd ("bpf: Add reference tracking to verifier")
> 
> 84dbf3507349696b fd978bf7fd312581a7ca454a99 
> ---------------- -------------------------- 
>          %stddev     %change         %stddev
>              \          |                \  
>      16832            -3.8%      16186        will-it-scale.per_process_ops
>    1346634            -3.8%    1294984        will-it-scale.workload
>     390809 ± 21%     +51.6%     592424 ± 27%  cpuidle.C1.time
>       6897            +2.7%       7085        proc-vmstat.nr_mapped
>     936.00 ±  7%     +15.6%       1082 ±  5%  
> slabinfo.Acpi-ParseExt.active_objs
>     936.00 ±  7%     +15.6%       1082 ±  5%  slabinfo.Acpi-ParseExt.num_objs
>     968.00 ±  9%     +27.5%       1233 ± 16%  
> slabinfo.pool_workqueue.active_objs
>     968.00 ±  9%     +29.7%       1255 ± 16%  slabinfo.pool_workqueue.num_objs
>       8430           -14.1%       7244 ±  2%  numa-meminfo.node0.KernelStack
>       4283 ± 14%     -22.4%       3325 ± 10%  numa-meminfo.node0.PageTables
>      73929 ±  3%     -10.6%      66061 ±  6%  numa-meminfo.node0.SUnreclaim
>       5569 ±  2%     +21.0%       6738 ±  3%  numa-meminfo.node1.KernelStack
>      55085 ±  5%     +17.5%      64739 ±  5%  numa-meminfo.node1.SUnreclaim
>      81155 ±  6%     +16.2%      94292 ±  7%  numa-meminfo.node1.Slab
>     230.00          -100.0%       0.00        numa-vmstat.node0.nr_active_file
>     100.25 ±  3%     -88.8%      11.25 ±173%  
> numa-vmstat.node0.nr_inactive_file
>       8431           -14.1%       7245 ±  2%  
> numa-vmstat.node0.nr_kernel_stack
>       1071 ± 14%     -22.4%     831.25 ± 10%  
> numa-vmstat.node0.nr_page_table_pages
>      18482 ±  3%     -10.6%      16515 ±  6%  
> numa-vmstat.node0.nr_slab_unreclaimable
>     230.00          -100.0%       0.00        
> numa-vmstat.node0.nr_zone_active_file
>     100.25 ±  3%     -88.8%      11.25 ±173%  
> numa-vmstat.node0.nr_zone_inactive_file
>       5569 ±  2%     +21.0%       6738 ±  3%  
> numa-vmstat.node1.nr_kernel_stack
>       2778 ±  3%     +28.4%       3567 ± 16%  numa-vmstat.node1.nr_mapped
>      13771 ±  5%     +17.5%      16184 ±  5%  
> numa-vmstat.node1.nr_slab_unreclaimable
>  1.506e+12            -2.5%  1.468e+12        perf-stat.branch-instructions
>      41.41            +0.8       42.20        perf-stat.cache-miss-rate%
>  2.165e+10            -1.7%  2.129e+10        perf-stat.cache-references
>      11.25            +2.8%      11.57        perf-stat.cpi
>  1.891e+12            -2.8%  1.838e+12        perf-stat.dTLB-loads
>  6.543e+11            -3.7%    6.3e+11        perf-stat.dTLB-stores
>  6.591e+12            -2.6%  6.419e+12        perf-stat.instructions
>       0.09            -2.7%       0.09        perf-stat.ipc
>  1.967e+09            -1.3%  1.941e+09        perf-stat.node-load-misses
>    4894750            +1.3%    4956596        perf-stat.path-length
>      40.37 ± 12%     -16.2%      33.81 ±  7%  
> sched_debug.cfs_rq:/.load_avg.stddev
>       0.05 ±  2%     +18.7%       0.06 ±  3%  
> sched_debug.cfs_rq:/.nr_running.stddev
>       6.37 ± 40%     -50.2%       3.17 ± 32%  
> sched_debug.cfs_rq:/.removed.load_avg.avg
>      31.64 ± 18%     -28.5%      22.63 ± 16%  
> sched_debug.cfs_rq:/.removed.load_avg.stddev
>     293.89 ± 40%     -50.1%     146.61 ± 32%  
> sched_debug.cfs_rq:/.removed.runnable_sum.avg
>       1459 ± 18%     -28.3%       1045 ± 16%  
> sched_debug.cfs_rq:/.removed.runnable_sum.stddev
>       2.46 ± 43%     -60.9%       0.96 ± 66%  
> sched_debug.cfs_rq:/.removed.util_avg.avg
>      12.42 ± 26%     -46.5%       6.64 ± 59%  
> sched_debug.cfs_rq:/.removed.util_avg.stddev
>     385.92 ±  6%     +12.8%     435.46 ±  2%  sched_debug.cpu.nr_switches.min
>     -14.21           -31.4%      -9.75        
> sched_debug.cpu.nr_uninterruptible.min
>      47.54            -0.2       47.31        
> perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.percpu_counter_add_batch.do_munmap.vm_munmap
>      47.67            -0.2       47.45        
> perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.percpu_counter_add_batch.do_munmap.vm_munmap.__x64_sys_munmap
>      48.04            -0.2       47.86        
> perf-profile.calltrace.cycles-pp.percpu_counter_add_batch.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64
>      99.36            -0.0       99.34        
> perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
>       1.47            +0.0        1.51        
> perf-profile.calltrace.cycles-pp.unmap_region.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64
>      94.77            -0.3       94.52        
> perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
>      95.04            -0.2       94.81        
> perf-profile.children.cycles-pp._raw_spin_lock_irqsave
>      95.77            -0.2       95.60        
> perf-profile.children.cycles-pp.percpu_counter_add_batch
>      49.72            -0.1       49.58        
> perf-profile.children.cycles-pp.do_munmap
>       0.53 ±  2%      -0.1        0.47        
> perf-profile.children.cycles-pp.___might_sleep
>       0.30 ±  2%      +0.0        0.33        
> perf-profile.children.cycles-pp.perf_event_mmap
>       0.30 ±  3%      +0.0        0.33 ±  2%  
> perf-profile.children.cycles-pp.vm_area_alloc
>       0.33 ±  2%      +0.0        0.36 ±  2%  
> perf-profile.children.cycles-pp.up_write
>       1.48            +0.0        1.51        
> perf-profile.children.cycles-pp.unmap_region
>      94.77            -0.3       94.52        
> perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
>       0.48 ±  2%      -0.0        0.44        
> perf-profile.self.cycles-pp.___might_sleep
>       0.33 ±  2%      +0.0        0.36 ±  2%  
> perf-profile.self.cycles-pp.up_write
>       0.53            +0.0        0.57        
> perf-profile.self.cycles-pp.unmap_page_range
>       0.47            +0.0        0.52 ±  2%  
> perf-profile.self.cycles-pp.percpu_counter_add_batch
> 
> 
> 
> 
> 
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
> 
> 
> Thanks,
> Rong Chen
>

Re: [LKP] [bpf] fd978bf7fd: will-it-scale.per_process_ops -4.0% regression

Reply via email to