Hi Jiri, Arnaldo, On 22.01.2019 20:45, Alexey Budankov wrote: > > It has been observed that trace reading thread runs on the same hw thread > most of the time during perf record sampling collection. This scheduling > layout leads up to 30% profiling overhead in case when some cpu intensive > workload fully utilizes a large server system with NUMA. Overhead usually > arises from remote (cross node) HW and memory references that have much > longer latencies than local ones [1]. > > This patch set implements --affinity option that lowers 30% overhead > completely for serial trace streaming (--affinity=cpu) and from 30% to > 10% for AIO1 (--aio=1) trace streaming (--affinity=node|cpu). > See OVERHEAD section below for more details. > > Implemented extension provides users with capability to instruct Perf > tool to bounce trace reading thread's affinity mask between NUMA nodes > (--affinity=node) or assign the thread to the exact cpu (--affinity=cpu) > that trace buffer being processed belongs to. > > The extension brings improvement in case of full system utilization when > Perf tool process contends with workload process on cpu cores. In case a > system has free cores to execute Perf tool process during profiling the > default system scheduling layout induces the lowest overhead. > > The patch set has been validated on BT benchmark from NAS Parallel > Benchmarks [2] running on dual socket, 44 cores, 88 hw threads Broadwell > system with kernels v4.4-21-generic (Ubuntu 16.04) and v4.20.0-rc5 > (tip perf/core). > > The patch set is for Arnaldo's perf/core repository. > > OVERHEAD: > BENCH REPORT BASED ELAPSED TIME BASED > v4.20.0-rc5 > (tip perf/core): > > (current) SERIAL-SYS / BASE : 1.27x (14.37/11.31), 1.29x (15.19/11.69) > SERIAL-NODE / BASE : 1.15x (13.04/11.31), 1.17x (13.79/11.69) > SERIAL-CPU / BASE : 1.00x (11.32/11.31), 1.01x (11.89/11.69) > > AIO1-SYS / BASE : 1.29x (14.58/11.31), 1.29x (15.26/11.69) > AIO1-NODE / BASE : 1.08x (12.23/11.31), 1,11x (13.01/11.69) > AIO1-CPU / BASE : 1.07x (12.14/11.31), 1.08x (12.83/11.69) > > v4.4.0-21-generic > (Ubuntu 16.04 LTS): > > (current) SERIAL-SYS / BASE : 1.26x (13.73/10.87), 1.29x (14.69/11.32) > SERIAL-NODE / BASE : 1.19x (13.02/10.87), 1.23x (14.03/11.32) > SERIAL-CPU / BASE : 1.03x (11.21/10.87), 1.07x (12.18/11.32) > > AIO1-SYS / BASE : 1.26x (13.73/10.87), 1.29x (14.69/11.32) > AIO1-NODE / BASE : 1.10x (12.04/10.87), 1.15x (13.03/11.32) > AIO1-CPU / BASE : 1.12x (12.20/10.87), 1.15x (13.09/11.32) > > --- > Alexey Budankov (4): > perf record: allocate affinity masks > perf record: bind the AIO user space buffers to nodes > perf record: apply affinity masks when reading mmap buffers > perf record: implement --affinity=node|cpu option > > tools/perf/Documentation/perf-record.txt | 5 ++ > tools/perf/builtin-record.c | 45 +++++++++- > tools/perf/perf.h | 8 ++ > tools/perf/util/cpumap.c | 10 +++ > tools/perf/util/cpumap.h | 1 + > tools/perf/util/evlist.c | 6 +- > tools/perf/util/evlist.h | 2 +- > tools/perf/util/mmap.c | 105 ++++++++++++++++++++++- > tools/perf/util/mmap.h | 3 +- > 9 files changed, 175 insertions(+), 10 deletions(-) > > --- > Changes in v5: > - avoided multiple allocations of online cpu maps by > implementing it once in cpu_map__online() > - reduced indentation at record__parse_affinity()
Are there any more comments on this patch set? Thanks, Alexey > > Changes in v4: > - fixed compilation issue converting pr_warn() to pr_warning() > - implemented stop if mbind() fails > - corrected mmap_params->cpu_map initialization to be based on > /sys/devices/system/cpu/online > - separated node cpu map generation into build_node_mask() > > Changes in v3: > - converted PERF_AFFINITY_EOF to PERF_AFFINITY_MAX > - corrected code style issues > - adjusted __aio_alloc,__aio_bind,__aio_free() implementation > - separated mask manipulations into __adjust_affinity() and > __setup_affinity_mask() > - implemented mapping of c index into online cpu index > - adjusted indentation at record__parse_affinity() > > Changes in v2: > - made debug affinity mode message user friendly > - converted affinity mode defines to enum values > - implemented perf_mmap__aio_alloc, perf_mmap__aio_free, perf_mmap__aio_bind > and put HAVE_LIBNUMA_SUPPORT #ifdefs in there > - separated AIO buffers binding to patch 2/4 > > --- > [1] https://en.wikipedia.org/wiki/Non-uniform_memory_access > [2] https://www.nas.nasa.gov/publications/npb.html > [3] http://man7.org/linux/man-pages/man2/sched_setaffinity.2.html > [4] http://man7.org/linux/man-pages/man2/mbind.2.html > > --- > ENVIRONMENT AND MEASUREMENTS: > > MACHINE: > > broadwell, dual socket, 44 core, 88 threads > > /proc/cpuinfo > > processor : 87 > vendor_id : GenuineIntel > cpu family : 6 > model : 79 > model name : Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz > stepping : 1 > microcode : 0xb000019 > cpu MHz : 1200.117 > cache size : 56320 KB > physical id : 1 > siblings : 44 > core id : 28 > cpu cores : 22 > apicid : 121 > initial apicid : 121 > fpu : yes > fpu_exception : yes > cpuid level : 20 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge > mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx > pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology > nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx > est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe > popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch > epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 > hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc > cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts > bugs : > bogomips : 4391.42 > clflush size : 64 > cache_alignment : 64 > address sizes : 46 bits physical, 48 bits virtual > power management: > > BASE: > > /usr/bin/time ./bt.B.x > > NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark > > No input file inputbt.data. Using compiled defaults > Size: 102x 102x 102 > Iterations: 200 dt: 0.0003000 > Number of available threads: 88 > > BT Benchmark Completed. > Class = B > Size = 102x 102x 102 > Iterations = 200 > Time in seconds = 10.87 > Total threads = 88 > Avail threads = 88 > Mop/s total = 64608.74 > Mop/s/thread = 734.19 > Operation type = floating point > Verification = SUCCESSFUL > Version = 3.3.1 > Compile date = 20 Sep 2018 > > 956.25user 19.14system 0:11.32elapsed 8616%CPU (0avgtext+0avgdata > 210496maxresident)k > 0inputs+0outputs (0major+57939minor)pagefaults 0swaps > > SERIAL-SYS: > > /usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 -a > -e cycles -- ./bt.B.x > Using CPUID GenuineIntel-6-4F-1 > nr_cblocks: 0 > affinity (UNSET:0, NODE:1, CPU:2) = 0 > mmap size 528384B > > NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark > > No input file inputbt.data. Using compiled defaults > Size: 102x 102x 102 > Iterations: 200 dt: 0.0003000 > Number of available threads: 88 > > BT Benchmark Completed. > Class = B > Size = 102x 102x 102 > Iterations = 200 > Time in seconds = 13.73 > Total threads = 88 > Avail threads = 88 > Mop/s total = 51136.52 > Mop/s/thread = 581.10 > Operation type = floating point > Verification = SUCCESSFUL > Version = 3.3.1 > Compile date = 20 Sep 2018 > > [ perf record: Captured and wrote 1661,120 MB perf.data ] > > 1184.84user 40.70system 0:14.69elapsed 8341%CPU (0avgtext+0avgdata > 208612maxresident)k > 0inputs+3402072outputs (0major+137077minor)pagefaults 0swaps > > SERIAL-NODE: > > /usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 > --affinity=node -a -e cycles -- ./bt.B.x > Using CPUID GenuineIntel-6-4F-1 > nr_cblocks: 0 > affinity (UNSET:0, NODE:1, CPU:2) = 1 > mmap size 528384B > > NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark > > No input file inputbt.data. Using compiled defaults > Size: 102x 102x 102 > Iterations: 200 dt: 0.0003000 > Number of available threads: 88 > > BT Benchmark Completed. > Class = B > Size = 102x 102x 102 > Iterations = 200 > Time in seconds = 13.02 > Total threads = 88 > Avail threads = 88 > Mop/s total = 53924.69 > Mop/s/thread = 612.78 > Operation type = floating point > Verification = SUCCESSFUL > Version = 3.3.1 > Compile date = 20 Sep 2018 > > [ perf record: Captured and wrote 1557,152 MB perf.data ] > > 1120.42user 29.92system 0:14.03elapsed 8198%CPU (0avgtext+0avgdata > 206388maxresident)k > 0inputs+3189128outputs (0major+149207minor)pagefaults 0swaps > > SERIAL-CPU: > > /usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 > --affinity=cpu -a -e cycles -- ./bt.B.x > Using CPUID GenuineIntel-6-4F-1 > nr_cblocks: 0 > affinity (UNSET:0, NODE:1, CPU:2) = 2 > mmap size 528384B > > NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark > > No input file inputbt.data. Using compiled defaults > Size: 102x 102x 102 > Iterations: 200 dt: 0.0003000 > Number of available threads: 88 > > BT Benchmark Completed. > Class = B > Size = 102x 102x 102 > Iterations = 200 > Time in seconds = 11.21 > Total threads = 88 > Avail threads = 88 > Mop/s total = 62642.24 > Mop/s/thread = 711.84 > Operation type = floating point > Verification = SUCCESSFUL > Version = 3.3.1 > Compile date = 20 Sep 2018 > > [ perf record: Captured and wrote 1365,043 MB perf.data ] > > 976.06user 31.35system 0:12.18elapsed 8264%CPU (0avgtext+0avgdata > 208488maxresident)k > 0inputs+2795704outputs (0major+126032minor)pagefaults 0swaps > > AIO1-SYS: > > /usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 > --aio=1 -a -e cycles -- ./bt.B.x > Using CPUID GenuineIntel-6-4F-1 > nr_cblocks: 1 > affinity (UNSET:0, NODE:1, CPU:2) = 0 > mmap size 528384B > > NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark > > No input file inputbt.data. Using compiled defaults > Size: 102x 102x 102 > Iterations: 200 dt: 0.0003000 > Number of available threads: 88 > > BT Benchmark Completed. > Class = B > Size = 102x 102x 102 > Iterations = 200 > Time in seconds = 14.23 > Total threads = 88 > Avail threads = 88 > Mop/s total = 49338.27 > Mop/s/thread = 560.66 > Operation type = floating point > Verification = SUCCESSFUL > Version = 3.3.1 > Compile date = 20 Sep 2018 > > [ perf record: Captured and wrote 1720,590 MB perf.data ] > > 1229.19user 41.99system 0:15.22elapsed 8350%CPU (0avgtext+0avgdata > 208604maxresident)k > 0inputs+3523880outputs (0major+124670minor)pagefaults 0swaps > > AIO1-NODE: > > /usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 > --aio=1 --affinity=node -a -e cycles -- ./bt.B.x > Using CPUID GenuineIntel-6-4F-1 > nr_cblocks: 1 > affinity (UNSET:0, NODE:1, CPU:2) = 1 > mmap size 528384B > > NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark > > No input file inputbt.data. Using compiled defaults > Size: 102x 102x 102 > Iterations: 200 dt: 0.0003000 > Number of available threads: 88 > > BT Benchmark Completed. > Class = B > Size = 102x 102x 102 > Iterations = 200 > Time in seconds = 12.04 > Total threads = 88 > Avail threads = 88 > Mop/s total = 58313.58 > Mop/s/thread = 662.65 > Operation type = floating point > Verification = SUCCESSFUL > Version = 3.3.1 > Compile date = 20 Sep 2018 > > [ perf record: Captured and wrote 1471,279 MB perf.data ] > > 1055.62user 30.43system 0:13.03elapsed 8333%CPU (0avgtext+0avgdata > 208424maxresident)k > 0inputs+3013288outputs (0major+79088minor)pagefaults 0swaps > > AIO1-CPU: > > /usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 > --aio=1 --affinity=cpu -a -e cycles -- ./bt.B.x > Using CPUID GenuineIntel-6-4F-1 > nr_cblocks: 1 > affinity (UNSET:0, NODE:1, CPU:2) = 2 > mmap size 528384B > > NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark > > No input file inputbt.data. Using compiled defaults > Size: 102x 102x 102 > Iterations: 200 dt: 0.0003000 > Number of available threads: 88 > > BT Benchmark Completed. > Class = B > Size = 102x 102x 102 > Iterations = 200 > Time in seconds = 12.20 > Total threads = 88 > Avail threads = 88 > Mop/s total = 57538.84 > Mop/s/thread = 653.85 > Operation type = floating point > Verification = SUCCESSFUL > Version = 3.3.1 > Compile date = 20 Sep 2018 > > [ perf record: Captured and wrote 1486,859 MB perf.data ] > > 1051.97user 42.07system 0:13.09elapsed 8352%CPU (0avgtext+0avgdata > 206388maxresident)k > 0inputs+3045168outputs (0major+174612minor)pagefaults 0swaps >