Hi, On 21.09.2018 15:15, Alexey Budankov wrote: > Hello Jiri, > > On 21.09.2018 9:13, Alexey Budankov wrote: >> Hello Jiri, >> >> On 14.09.2018 12:37, Alexey Budankov wrote: >>> On 14.09.2018 11:28, Jiri Olsa wrote: >>>> On Fri, Sep 14, 2018 at 10:26:53AM +0200, Jiri Olsa wrote: >>>> >>>> SNIP >>>> >>>>>>> The threaded monitoring currently can't monitor backward maps >>>>>>> and there are probably more limitations which I haven't spotted >>>>>>> yet. >>>>>>> >>>>>>> So far I tested on laptop: >>>>>>> http://people.redhat.com/~jolsa/record_threads/test-4CPU.txt >>>>>>> >>>>>>> and a one bigger server: >>>>>>> http://people.redhat.com/~jolsa/record_threads/test-208CPU.txt >>>>>>> >>>>>>> I can see decrease in recorded LOST events, but both the benchmark >>>>>>> and the monitoring must be carefully configured wrt: >>>>>>> - number of events (frequency) >>>>>>> - size of the memory maps >>>>>>> - size of events (callchains) >>>>>>> - final perf.data size >>>>>>> >>>>>>> It's also available in: >>>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git >>>>>>> perf/record_threads >>>>>>> >>>>>>> thoughts? ;-) thanks >>>>>>> jirka >>>>>> >>>>>> It is preferable to split into smaller pieces that bring >>>>>> some improvement proved by metrics numbers and ready for >>>>>> merging and upstream. Do we have more metrics than the >>>>>> data loss from trace AIO patches? >>>>> >>>>> well the primary focus is to get more events in, >>>>> so the LOST metric is the main one >>>> >>>> actualy I was hoping, could you please run it through the same >>>> tests as you do for AIO code on some huge server? >>> >>> Yeah, I will, but it takes some time. >> >> Here it is: >> >> Hardware: >> cat /proc/cpuinfo >> processor : 271 >> vendor_id : GenuineIntel >> cpu family : 6 >> model : 133 >> model name : Intel(R) Xeon Phi(TM) CPU 7285 @ 1.30GHz >> stepping : 0 >> microcode : 0xe >> cpu MHz : 1064.235 >> cache size : 1024 KB >> physical id : 0 >> siblings : 272 >> core id : 73 >> cpu cores : 68 >> apicid : 295 >> initial apicid : 295 >> fpu : yes >> fpu_exception : yes >> cpuid level : 13 >> wp : yes >> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge >> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall >> nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl >> xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl >> vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt >> tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch >> ring3mwait cpuid_fault epb pti tpr_shadow vnmi flexpriority ept vpid >> fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms avx512f rdseed adx avx512pf >> avx512er avx512cd xsaveopt dtherm ida arat pln pts avx512_vpopcntdq >> avx512_4vnniw avx512_4fmaps >> bugs : cpu_meltdown spectre_v1 spectre_v2 >> bogomips : 2594.07 >> clflush size : 64 >> cache_alignment : 64 >> address sizes : 46 bits physical, 48 bits virtual >> power management: >> >> uname -a >> Linux nntpat98-196 4.18.0-rc7+ #2 SMP Thu Sep 6 13:24:37 MSK 2018 x86_64 >> x86_64 x86_64 GNU/Linux >> >> cat /proc/sys/kernel/perf_event_paranoid >> 0 >> >> cat /proc/sys/kernel/perf_event_mlock_kb >> 516 >> >> cat /proc/sys/kernel/perf_event_max_sample_rate >> 3000 >> >> cat /etc/redhat-release >> Red Hat Enterprise Linux Server release 7.5 (Maipo) >> >> Metrics: >> runtime overhead (%) : elapsed_time_under_profiling / elapsed_time >> data loss (%) : paused_time / elapsed_time_under_profiling >> LOST events : stat from perf report --stats >> SAMPLE events : stat from perf report --stats >> perf.data size (B) : size of trace file on disk >> >> Events: >> cpu/period=P,event=0x3c/Duk;CPU_CLK_UNHALTED.THREAD >> cpu/period=P,umask=0x3/Duk;CPU_CLK_UNHALTED.REF_TSC >> cpu/period=P,event=0xc0/Duk;INST_RETIRED.ANY >> cpu/period=0xaae61,event=0xc2,umask=0x10/uk;UOPS_RETIRED.ALL >> cpu/period=0x11171,event=0xc2,umask=0x20/uk;UOPS_RETIRED.SCALAR_SIMD >> cpu/period=0x11171,event=0xc2,umask=0x40/uk;UOPS_RETIRED.PACKED_SIMD >> >> ================================================= >> >> Command: >> /usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf.thr record >> --threads=T \ >> -a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \ >> -e cpu/period=P,event=0x3c/Duk,\ >> cpu/period=P,umask=0x3/Duk,\ >> cpu/period=P,event=0xc0/Duk,\ >> cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\ >> cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\ >> cpu/period=0x4e20,event=0xc2,umask=0x40/uk \ >> --clockid=monotonic_raw -- ./matrix.(icc|gcc) >> >> Workload: matrix multiplication in 256 threads >> >> /usr/bin/time ./matrix.icc >> Addr of buf1 = 0x7ff9faa73010 >> Offs of buf1 = 0x7ff9faa73180 >> Addr of buf2 = 0x7ff9f8a72010 >> Offs of buf2 = 0x7ff9f8a721c0 >> Addr of buf3 = 0x7ff9f6a71010 >> Offs of buf3 = 0x7ff9f6a71100 >> Addr of buf4 = 0x7ff9f4a70010 >> Offs of buf4 = 0x7ff9f4a70140 >> Threads #: 256 Pthreads >> Matrix size: 2048 >> Using multiply kernel: multiply1 >> Freq = 0.997720 GHz >> Execution time = 9.061 seconds >> 1639.55user 6.59system 0:07.12elapsed 23094%CPU (0avgtext+0avgdata >> 100448maxresident)k >> 96inputs+0outputs (1major+33839minor)pagefaults 0swaps >> >> T : 272 >> P (period, ms) : 0.1 >> runtime overhead (%) : 45x ~ 323.54 / 7.12 >> data loss (%) : 96 >> LOST events : 323662 >> SAMPLE events : 31885479 >> perf.data size (GiB) : 42 >> >> P (period, ms) : 0.25 >> runtime overhead (%) : 25x ~ 180.76 / 7.12 >> data loss (%) : 69 >> LOST events : 10636 >> SAMPLE events : 18692998 >> perf.data size (GiB) : 23.5 >> >> P (period, ms) : 0.35 >> runtime overhead (%) : 16x ~ 119.49 / 7.12 >> data loss (%) : 1 >> LOST events : 6 >> SAMPLE events : 11178524 >> perf.data size (GiB) : 14 >> >> T : 128 >> P (period, ms) : 0.35 >> runtime overhead (%) : 15x ~ 111.98 / 7.12 >> data loss (%) : 62 >> LOST events : 2825 >> SAMPLE events : 11267247 >> perf.data size (GiB) : 15 >> >> T : 64 >> P (period, ms) : 0.35 >> runtime overhead (%) : 14x ~ 101.55 / 7.12 >> data loss (%) : 67 >> LOST events : 5155 >> SAMPLE events : 10966297 >> perf.data size (GiB) : 13.7 >> >> Workload: matrix multiplication in 128 threads >> >> /usr/bin/time ./matrix.gcc >> Addr of buf1 = 0x7f072e630010 >> Offs of buf1 = 0x7f072e630180 >> Addr of buf2 = 0x7f072c62f010 >> Offs of buf2 = 0x7f072c62f1c0 >> Addr of buf3 = 0x7f072a62e010 >> Offs of buf3 = 0x7f072a62e100 >> Addr of buf4 = 0x7f072862d010 >> Offs of buf4 = 0x7f072862d140 >> Threads #: 128 Pthreads >> Matrix size: 2048 >> Using multiply kernel: multiply1 >> Execution time = 6.639 seconds >> 767.03user 11.17system 0:06.81elapsed 11424%CPU (0avgtext+0avgdata >> 100756maxresident)k >> 88inputs+0outputs (0major+139898minor)pagefaults 0swaps >> >> T : 272 >> P (period, ms) : 0.1 >> runtime overhead (%) : 29x ~ 198.81 / 6.81 >> data loss (%) : 21 >> LOST events : 2502 >> SAMPLE events : 22481062 >> perf.data size (GiB) : 27.6 >> >> P (period, ms) : 0.25 >> runtime overhead (%) : 13x ~ 88.47 / 6.81 >> data loss (%) : 0 >> LOST events : 0 >> SAMPLE events : 9572787 >> perf.data size (GiB) : 11.3 >> >> P (period, ms) : 0.35 >> runtime overhead (%) : 10x ~ 67.11 / 6.81 >> data loss (%) : 1 >> LOST events : 137 >> SAMPLE events : 6985930 >> perf.data size (GiB) : 8 >> >> T : 128 >> P (period, ms) : 0.35 >> runtime overhead (%) : 9.5x ~ 64.33 / 6.81 >> data loss (%) : 1 >> LOST events : 3 >> SAMPLE events : 6666903 >> perf.data size (GiB) : 7.8 >> >> T : 64 >> P (period, ms) : 0.25 >> runtime overhead (%) : 17x ~ 114.27 / 6.81 >> data loss (%) : 2 >> LOST events : 52 >> SAMPLE events : 12643645 >> perf.data size (GiB) : 15.5 >> >> P (period, ms) : 0.35 >> runtime overhead (%) : 10x ~ 68.60 / 6.81 >> data loss (%) : 1 >> LOST events : 93 >> SAMPLE events : 7164368 >> perf.data size (GiB) : 8.5 > > and this is for AIO and serial: > > Command: > /usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf.aio record --aio=N \ > -a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \ > -e cpu/period=P,event=0x3c/Duk,\ > cpu/period=P,umask=0x3/Duk,\ > cpu/period=P,event=0xc0/Duk,\ > cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\ > cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\ > cpu/period=0x4e20,event=0xc2,umask=0x40/uk \ > --clockid=monotonic_raw -- ./matrix.(icc|gcc) > > Workload: matrix multiplication in 256 threads > > N : 512 > P (period, ms) : 2.5 > runtime overhead (%) : 2.7x ~ 19.21 / 7.12 > data loss (%) : 42 > LOST events : 1600 > SAMPLE events : 1235928 > perf.data size (GiB) : 1.5 > > N : 272 > P (period, ms) : 1.5 > runtime overhead (%) : 2.5x ~ 18.09 / 7.12 > data loss (%) : 89 > LOST events : 3457 > SAMPLE events : 1222143 > perf.data size (GiB) : 1.5 > > P (period, ms) : 2 > runtime overhead (%) : 2.5x ~ 17.93 / 7.12 > data loss (%) : 65 > LOST events : 2496 > SAMPLE events : 1240754 > perf.data size (GiB) : 1.5 > > P (period, ms) : 2.5 > runtime overhead (%) : 2.5x ~ 17.87 / 7.12 > data loss (%) : 44 > LOST events : 1621 > SAMPLE events : 1221949 > perf.data size (GiB) : 1.5 > > P (period, ms) : 3 > runtime overhead (%) : 2.5x ~ 18.43 / 7.12 > data loss (%) : 12 > LOST events : 350 > SAMPLE events : 1117972 > perf.data size (GiB) : 1.3 > > N : 128 > P (period, ms) : 3 > runtime overhead (%) : 2.4x ~ 17.08 / 7.12 > data loss (%) : 11 > LOST events : 335 > SAMPLE events : 1116832 > perf.data size (GiB) : 1.3 > > N : 64 > P (period, ms) : 3 > runtime overhead (%) : 2.2x ~ 16.03 / 7.12 > data loss (%) : 11 > LOST events : 329 > SAMPLE events : 1108205 > perf.data size (GiB) : 1.3 > > Workload: matrix multiplication in 128 threads > > N : 512 > P (period, ms) : 1 > runtime overhead (%) : 3.5x ~ 23.72 / 6.81 > data loss (%) : 18 > LOST events : 1043 > SAMPLE events : 2015306 > perf.data size (GiB) : 2.3 > > N : 272 > P (period, ms) : 0.5 > runtime overhead (%) : 3x ~ 22.72 / 6.81 > data loss (%) : 90 > LOST events : 5842 > SAMPLE events : 2205937 > perf.data size (GiB) : 2.5 > > P (period, ms) : 1 > runtime overhead (%) : 3x ~ 22.79 / 6.81 > data loss (%) : 11 > LOST events : 481 > SAMPLE events : 2017099 > perf.data size (GiB) : 2.5 > > P (period, ms) : 1.5 > runtime overhead (%) : 3x ~ 19.93 / 6.81 > data loss (%) : 5 > LOST events : 190 > SAMPLE events : 1308692 > perf.data size (GiB) : 1.5 > > P (period, ms) : 2 > runtime overhead (%) : 3x ~ 18.95 / 6.81 > data loss (%) : 0 > LOST events : 0 > SAMPLE events : 1010769 > perf.data size (GiB) : 1.2 > > N : 128 > P (period, ms) : 1.5 > runtime overhead (%) : 3x ~ 19.08 / 6.81 > data loss (%) : 6 > LOST events : 220 > SAMPLE events : 1322240 > perf.data size (GiB) : 1.5 > > N : 64 > P (period, ms) : 1.5 > runtime overhead (%) : 3x ~ 19.43 / 6.81 > data loss (%) : 3 > LOST events : 130 > SAMPLE events : 1386521 > perf.data size (GiB) : 1.6 > > ================================================= > > Command: > /usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf record \ > -a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \ > -e cpu/period=P,event=0x3c/Duk,\ > cpu/period=P,umask=0x3/Duk,\ > cpu/period=P,event=0xc0/Duk,\ > cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\ > cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\ > cpu/period=0x4e20,event=0xc2,umask=0x40/uk \ > --clockid=monotonic_raw -- ./matrix.(icc|gcc) > > Workload: matrix multiplication in 256 threads > > P (period, ms) : 7.5 > runtime overhead (%) : 1.6x ~ 11.6 / 7.12 > data loss (%) : 1 > LOST events : 1 > SAMPLE events : 451062 > perf.data size (GiB) : 0.5 > > Workload: matrix multiplication in 128 threads > > P (period, ms) : 3 > runtime overhead (%) : 1.8x ~ 12.58 / 6.81 > data loss (%) : 9 > LOST events : 147 > SAMPLE events : 673299 > perf.data size (GiB) : 0.8
Please see more comparable data by P (period, ms), runtime overhead and data loss metrics at the same time. It start from serial implementation as the baseline and then demonstrates possible improvement applying configurable --aio(=N) and --threads(=T) implementations. Smaller P values, with data loss and runtime overhead values equal or in small vicinity of the ones from serial implementation, might mean possible gain. Workload: matrix multiplication in 128 threads Serial: P (period, ms) : 3 runtime overhead (%) : 1.8x ~ 12.58 / 6.81 data loss (%) : 9 LOST events : 147 SAMPLE events : 673299 perf.data size (GiB) : 0.8 AIO: N : 1 P (period, ms) : 3 runtime overhead (%) : 1.8x ~ 12.42 / 6.81 data loss (%) : 2 LOST events : 19 SAMPLE events : 664749 perf.data size (GiB) : 0.75 N : 4 P (period, ms) : 1.8 runtime overhead (%) : 1.8x ~ 12.74 / 6.81 data loss (%) : 10 LOST events : 257 SAMPLE events : 1079250 perf.data size (GiB) : 1.25 Threads: T : 1 P (period, ms) : 3 runtime overhead (%) : 2.6x ~ 17.73 / 6.81 data loss (%) : 6 LOST events : 95 SAMPLE events : 665844 perf.data size (GiB) : 0.78 T : 2 P (period, ms) : 3 runtime overhead (%) : 2.6x ~ 18.04 / 6.81 data loss (%) : 0 LOST events : 0 SAMPLE events : 662075 perf.data size (GiB) : 0.8 P (period, ms) : 1.8 runtime overhead (%) : 3x ~ 20.83 / 6.81 data loss (%) : 4 LOST events : 76 SAMPLE events : 1085826 perf.data size (GiB) : 1.25 T : 4 P (period, ms) : 3 runtime overhead (%) : 2.6x ~ 17.85 / 6.81 data loss (%) : 0 LOST events : 0 SAMPLE events : 665262 perf.data size (GiB) : 0.78 P (period, ms) : 1.8 runtime overhead (%) : 3x ~ 21.15 / 6.81 data loss (%) : 0 LOST events : 0 SAMPLE events : 1126563 perf.data size (GiB) : 1.3 P (period, ms) : 1 runtime overhead (%) : 4.35x ~ 29.6 / 6.81 data loss (%) : 0 LOST events : 6 SAMPLE events : 2124837 perf.data size (GiB) : 2.5 P (period, ms) : 0.8 runtime overhead (%) : 4.8x ~ 32.62 / 6.81 data loss (%) : 12 LOST events : 536 SAMPLE events : 2620345 perf.data size (GiB) : 3 Thanks, Alexey > > Thanks, > Alexey > >> >> Thanks, >> Alexey >> >>> >>>> >>>> thanks, >>>> jirka >>>> >>> >> >