Re: [RFCv2 00/48] perf tools: Add threads to record command

Alexey Budankov Mon, 24 Sep 2018 12:24:02 -0700

Hi,

On 21.09.2018 15:15, Alexey Budankov wrote:
> Hello Jiri,
> 
> On 21.09.2018 9:13, Alexey Budankov wrote:
>> Hello Jiri,
>>
>> On 14.09.2018 12:37, Alexey Budankov wrote:
>>> On 14.09.2018 11:28, Jiri Olsa wrote:
>>>> On Fri, Sep 14, 2018 at 10:26:53AM +0200, Jiri Olsa wrote:
>>>>
>>>> SNIP
>>>>
>>>>>>> The threaded monitoring currently can't monitor backward maps
>>>>>>> and there are probably more limitations which I haven't spotted
>>>>>>> yet.
>>>>>>>
>>>>>>> So far I tested on laptop:
>>>>>>>   http://people.redhat.com/~jolsa/record_threads/test-4CPU.txt
>>>>>>>
>>>>>>> and a one bigger server:
>>>>>>>   http://people.redhat.com/~jolsa/record_threads/test-208CPU.txt
>>>>>>>
>>>>>>> I can see decrease in recorded LOST events, but both the benchmark
>>>>>>> and the monitoring must be carefully configured wrt:
>>>>>>>   - number of events (frequency)
>>>>>>>   - size of the memory maps
>>>>>>>   - size of events (callchains)
>>>>>>>   - final perf.data size
>>>>>>>
>>>>>>> It's also available in:
>>>>>>>   git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
>>>>>>>   perf/record_threads
>>>>>>>
>>>>>>> thoughts? ;-) thanks
>>>>>>> jirka
>>>>>>
>>>>>> It is preferable to split into smaller pieces that bring 
>>>>>> some improvement proved by metrics numbers and ready for 
>>>>>> merging and upstream. Do we have more metrics than the 
>>>>>> data loss from trace AIO patches?
>>>>>
>>>>> well the primary focus is to get more events in,
>>>>> so the LOST metric is the main one
>>>>
>>>> actualy I was hoping, could you please run it through the same
>>>> tests as you do for AIO code on some huge server? 
>>>
>>> Yeah, I will, but it takes some time.
>>
>> Here it is:
>>
>> Hardware:
>> cat /proc/cpuinfo
>> processor    : 271
>> vendor_id    : GenuineIntel
>> cpu family   : 6
>> model                : 133
>> model name   : Intel(R) Xeon Phi(TM) CPU 7285 @ 1.30GHz
>> stepping     : 0
>> microcode    : 0xe
>> cpu MHz              : 1064.235
>> cache size   : 1024 KB
>> physical id  : 0
>> siblings     : 272
>> core id              : 73
>> cpu cores    : 68
>> apicid               : 295
>> initial apicid       : 295
>> fpu          : yes
>> fpu_exception        : yes
>> cpuid level  : 13
>> wp           : yes
>> flags                : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
>> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall 
>> nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl 
>> xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl 
>> vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt 
>> tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch 
>> ring3mwait cpuid_fault epb pti tpr_shadow vnmi flexpriority ept vpid 
>> fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms avx512f rdseed adx avx512pf 
>> avx512er avx512cd xsaveopt dtherm ida arat pln pts avx512_vpopcntdq 
>> avx512_4vnniw avx512_4fmaps
>> bugs         : cpu_meltdown spectre_v1 spectre_v2
>> bogomips     : 2594.07
>> clflush size : 64
>> cache_alignment      : 64
>> address sizes        : 46 bits physical, 48 bits virtual
>> power management:
>>
>> uname -a
>> Linux nntpat98-196 4.18.0-rc7+ #2 SMP Thu Sep 6 13:24:37 MSK 2018 x86_64 
>> x86_64 x86_64 GNU/Linux
>>
>> cat /proc/sys/kernel/perf_event_paranoid
>> 0
>>
>> cat /proc/sys/kernel/perf_event_mlock_kb 
>> 516
>>
>> cat /proc/sys/kernel/perf_event_max_sample_rate 
>> 3000
>>
>> cat /etc/redhat-release 
>> Red Hat Enterprise Linux Server release 7.5 (Maipo)
>>
>> Metrics:
>> runtime overhead (%) : elapsed_time_under_profiling / elapsed_time
>> data loss (%)        : paused_time / elapsed_time_under_profiling
>> LOST events          : stat from perf report --stats
>> SAMPLE events        : stat from perf report --stats
>> perf.data size (B)   : size of trace file on disk
>>
>> Events:
>> cpu/period=P,event=0x3c/Duk;CPU_CLK_UNHALTED.THREAD
>> cpu/period=P,umask=0x3/Duk;CPU_CLK_UNHALTED.REF_TSC
>> cpu/period=P,event=0xc0/Duk;INST_RETIRED.ANY
>> cpu/period=0xaae61,event=0xc2,umask=0x10/uk;UOPS_RETIRED.ALL
>> cpu/period=0x11171,event=0xc2,umask=0x20/uk;UOPS_RETIRED.SCALAR_SIMD
>> cpu/period=0x11171,event=0xc2,umask=0x40/uk;UOPS_RETIRED.PACKED_SIMD
>>
>> =================================================
>>
>> Command:
>> /usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf.thr record 
>> --threads=T \
>>      -a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
>>         -e cpu/period=P,event=0x3c/Duk,\
>>            cpu/period=P,umask=0x3/Duk,\
>>            cpu/period=P,event=0xc0/Duk,\
>>            cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\
>>            cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\
>>            cpu/period=0x4e20,event=0xc2,umask=0x40/uk \
>>          --clockid=monotonic_raw -- ./matrix.(icc|gcc)
>>
>> Workload: matrix multiplication in 256 threads
>>
>> /usr/bin/time ./matrix.icc
>> Addr of buf1 = 0x7ff9faa73010
>> Offs of buf1 = 0x7ff9faa73180
>> Addr of buf2 = 0x7ff9f8a72010
>> Offs of buf2 = 0x7ff9f8a721c0
>> Addr of buf3 = 0x7ff9f6a71010
>> Offs of buf3 = 0x7ff9f6a71100
>> Addr of buf4 = 0x7ff9f4a70010
>> Offs of buf4 = 0x7ff9f4a70140
>> Threads #: 256 Pthreads
>> Matrix size: 2048
>> Using multiply kernel: multiply1
>> Freq = 0.997720 GHz
>> Execution time = 9.061 seconds
>> 1639.55user 6.59system 0:07.12elapsed 23094%CPU (0avgtext+0avgdata 
>> 100448maxresident)k
>> 96inputs+0outputs (1major+33839minor)pagefaults 0swaps
>>
>> T : 272
>>         P (period, ms)       : 0.1
>>      runtime overhead (%) : 45x ~ 323.54 / 7.12
>>      data loss (%)        : 96
>>      LOST events          : 323662
>>      SAMPLE events        : 31885479
>>         perf.data size (GiB) : 42
>>
>>      P (period, ms)       : 0.25
>>      runtime overhead (%) : 25x ~ 180.76 / 7.12
>>      data loss (%)        : 69 
>>      LOST events          : 10636
>>      SAMPLE events        : 18692998
>>         perf.data size (GiB) : 23.5
>>
>>      P (period, ms)       : 0.35 
>>      runtime overhead (%) : 16x ~ 119.49 / 7.12
>>      data loss (%)        : 1
>>      LOST events          : 6
>>      SAMPLE events        : 11178524
>>         perf.data size (GiB) : 14
>>
>> T : 128
>>      P (period, ms)       : 0.35 
>>      runtime overhead (%) : 15x ~ 111.98 / 7.12
>>      data loss (%)        : 62
>>      LOST events          : 2825
>>      SAMPLE events        : 11267247
>>         perf.data size (GiB) : 15
>>
>> T : 64
>>      P (period, ms)       : 0.35 
>>      runtime overhead (%) : 14x ~ 101.55 / 7.12
>>      data loss (%)        : 67
>>      LOST events          : 5155
>>      SAMPLE events        : 10966297
>>         perf.data size (GiB) : 13.7
>>
>> Workload: matrix multiplication in 128 threads
>>
>> /usr/bin/time ./matrix.gcc
>> Addr of buf1 = 0x7f072e630010
>> Offs of buf1 = 0x7f072e630180
>> Addr of buf2 = 0x7f072c62f010
>> Offs of buf2 = 0x7f072c62f1c0
>> Addr of buf3 = 0x7f072a62e010
>> Offs of buf3 = 0x7f072a62e100
>> Addr of buf4 = 0x7f072862d010
>> Offs of buf4 = 0x7f072862d140
>> Threads #: 128 Pthreads
>> Matrix size: 2048
>> Using multiply kernel: multiply1
>> Execution time = 6.639 seconds
>> 767.03user 11.17system 0:06.81elapsed 11424%CPU (0avgtext+0avgdata 
>> 100756maxresident)k
>> 88inputs+0outputs (0major+139898minor)pagefaults 0swaps
>>
>> T : 272
>>         P (period, ms)       : 0.1
>>      runtime overhead (%) : 29x ~ 198.81 / 6.81
>>      data loss (%)        : 21
>>      LOST events          : 2502
>>      SAMPLE events        : 22481062
>>         perf.data size (GiB) : 27.6
>>
>>      P (period, ms)       : 0.25
>>      runtime overhead (%) : 13x ~ 88.47 / 6.81
>>      data loss (%)        : 0
>>      LOST events          : 0
>>      SAMPLE events        : 9572787
>>         perf.data size (GiB) : 11.3
>>
>>      P (period, ms)       : 0.35 
>>      runtime overhead (%) : 10x ~ 67.11 / 6.81
>>      data loss (%)        : 1
>>      LOST events          : 137
>>      SAMPLE events        : 6985930
>>         perf.data size (GiB) : 8
>>
>> T : 128
>>      P (period, ms)       : 0.35 
>>      runtime overhead (%) : 9.5x ~ 64.33 / 6.81
>>      data loss (%)        : 1
>>      LOST events          : 3
>>      SAMPLE events        : 6666903
>>         perf.data size (GiB) : 7.8
>>
>> T : 64
>>      P (period, ms)       : 0.25
>>      runtime overhead (%) : 17x ~ 114.27 / 6.81
>>      data loss (%)        : 2
>>      LOST events          : 52
>>      SAMPLE events        : 12643645
>>         perf.data size (GiB) : 15.5
>>
>>      P (period, ms)       : 0.35 
>>      runtime overhead (%) : 10x ~ 68.60 / 6.81
>>      data loss (%)        : 1
>>      LOST events          : 93
>>      SAMPLE events        : 7164368
>>         perf.data size (GiB) : 8.5
> 
> and this is for AIO and serial:
> 
> Command:
> /usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf.aio record --aio=N \
>       -a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
>         -e cpu/period=P,event=0x3c/Duk,\
>            cpu/period=P,umask=0x3/Duk,\
>            cpu/period=P,event=0xc0/Duk,\
>            cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\
>            cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\
>            cpu/period=0x4e20,event=0xc2,umask=0x40/uk \
>          --clockid=monotonic_raw -- ./matrix.(icc|gcc)
> 
> Workload: matrix multiplication in 256 threads
> 
>  N : 512
>         P (period, ms)       : 2.5
>       runtime overhead (%) : 2.7x ~ 19.21 / 7.12
>       data loss (%)        : 42
>       LOST events          : 1600
>       SAMPLE events        : 1235928
>         perf.data size (GiB) : 1.5
>  
>  N : 272
>       P (period, ms)       : 1.5
>       runtime overhead (%) : 2.5x ~ 18.09 / 7.12
>       data loss (%)        : 89
>       LOST events          : 3457
>       SAMPLE events        : 1222143
>         perf.data size (GiB) : 1.5
> 
>       P (period, ms)       : 2
>       runtime overhead (%) : 2.5x ~ 17.93 / 7.12
>       data loss (%)        : 65
>       LOST events          : 2496
>       SAMPLE events        : 1240754
>         perf.data size (GiB) : 1.5
> 
>       P (period, ms)       : 2.5
>       runtime overhead (%) : 2.5x ~ 17.87 / 7.12
>       data loss (%)        : 44
>       LOST events          : 1621
>       SAMPLE events        : 1221949
>         perf.data size (GiB) : 1.5
> 
>       P (period, ms)       : 3
>       runtime overhead (%) : 2.5x ~ 18.43 / 7.12
>       data loss (%)        : 12
>       LOST events          : 350
>       SAMPLE events        : 1117972
>         perf.data size (GiB) : 1.3
>  
>  N : 128
>       P (period, ms)       : 3
>       runtime overhead (%) : 2.4x ~ 17.08 / 7.12
>       data loss (%)        : 11
>       LOST events          : 335
>       SAMPLE events        : 1116832
>         perf.data size (GiB) : 1.3
>  
>  N : 64
>       P (period, ms)       : 3
>       runtime overhead (%) : 2.2x ~ 16.03 / 7.12
>       data loss (%)        : 11
>       LOST events          : 329
>       SAMPLE events        : 1108205
>         perf.data size (GiB) : 1.3
>  
> Workload: matrix multiplication in 128 threads
> 
>  N : 512
>         P (period, ms)       : 1
>       runtime overhead (%) : 3.5x ~ 23.72 / 6.81
>       data loss (%)        : 18
>       LOST events          : 1043
>       SAMPLE events        : 2015306
>         perf.data size (GiB) : 2.3
> 
>  N : 272
>         P (period, ms)       : 0.5
>       runtime overhead (%) : 3x ~ 22.72 / 6.81
>       data loss (%)        : 90
>       LOST events          : 5842
>       SAMPLE events        : 2205937
>         perf.data size (GiB) : 2.5
> 
>         P (period, ms)       : 1
>       runtime overhead (%) : 3x ~ 22.79 / 6.81
>       data loss (%)        : 11
>       LOST events          : 481
>       SAMPLE events        : 2017099
>         perf.data size (GiB) : 2.5
>  
>       P (period, ms)       : 1.5
>       runtime overhead (%) : 3x ~ 19.93 / 6.81
>       data loss (%)        : 5
>       LOST events          : 190
>       SAMPLE events        : 1308692
>         perf.data size (GiB) : 1.5
>  
>       P (period, ms)       : 2
>       runtime overhead (%) : 3x ~ 18.95 / 6.81
>       data loss (%)        : 0
>       LOST events          : 0
>       SAMPLE events        : 1010769
>         perf.data size (GiB) : 1.2
>  
>  N : 128
>       P (period, ms)       : 1.5
>       runtime overhead (%) : 3x ~ 19.08 / 6.81
>       data loss (%)        : 6
>       LOST events          : 220
>       SAMPLE events        : 1322240
>         perf.data size (GiB) : 1.5
>  
>  N : 64
>       P (period, ms)       : 1.5
>       runtime overhead (%) : 3x ~ 19.43 / 6.81
>       data loss (%)        : 3
>       LOST events          : 130
>       SAMPLE events        : 1386521
>         perf.data size (GiB) : 1.6
> 
> =================================================
> 
> Command:
> /usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf record \
>       -a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
>         -e cpu/period=P,event=0x3c/Duk,\
>            cpu/period=P,umask=0x3/Duk,\
>            cpu/period=P,event=0xc0/Duk,\
>            cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\
>            cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\
>            cpu/period=0x4e20,event=0xc2,umask=0x40/uk \
>          --clockid=monotonic_raw -- ./matrix.(icc|gcc)
> 
> Workload: matrix multiplication in 256 threads
> 
>       P (period, ms)       : 7.5
>       runtime overhead (%) : 1.6x ~ 11.6 / 7.12
>       data loss (%)        : 1
>       LOST events          : 1
>       SAMPLE events        : 451062
>         perf.data size (GiB) : 0.5
> 
> Workload: matrix multiplication in 128 threads
> 
>       P (period, ms)       : 3
>       runtime overhead (%) : 1.8x ~ 12.58 / 6.81
>       data loss (%)        : 9
>       LOST events          : 147
>       SAMPLE events        : 673299
>         perf.data size (GiB) : 0.8


Please see more comparable data by P (period, ms), 
runtime overhead and data loss metrics at the same time.

It start from serial implementation as the baseline and 
then demonstrates possible improvement applying configurable 
--aio(=N) and --threads(=T) implementations.

Smaller P values, with data loss and runtime overhead values
equal or in small vicinity of the ones from serial implementation,
might mean possible gain.

Workload: matrix multiplication in 128 threads

Serial:
        P (period, ms)       : 3
        runtime overhead (%) : 1.8x ~ 12.58 / 6.81
        data loss (%)        : 9
        LOST events          : 147
        SAMPLE events        : 673299
        perf.data size (GiB) : 0.8

AIO:
    N : 1
        P (period, ms)       : 3
        runtime overhead (%) : 1.8x ~ 12.42 / 6.81
        data loss (%)        : 2
        LOST events          : 19
        SAMPLE events        : 664749
        perf.data size (GiB) : 0.75

    N : 4
        P (period, ms)       : 1.8
        runtime overhead (%) : 1.8x ~ 12.74 / 6.81
        data loss (%)        : 10
        LOST events          : 257
        SAMPLE events        : 1079250
        perf.data size (GiB) : 1.25

Threads:
    T : 1
        P (period, ms)       : 3
        runtime overhead (%) : 2.6x ~ 17.73 / 6.81
        data loss (%)        : 6
        LOST events          : 95
        SAMPLE events        : 665844
        perf.data size (GiB) : 0.78

    T : 2
        P (period, ms)       : 3
        runtime overhead (%) : 2.6x ~ 18.04 / 6.81
        data loss (%)        : 0
        LOST events          : 0
        SAMPLE events        : 662075
        perf.data size (GiB) : 0.8

        P (period, ms)       : 1.8
        runtime overhead (%) : 3x ~ 20.83 / 6.81
        data loss (%)        : 4
        LOST events          : 76
        SAMPLE events        : 1085826
        perf.data size (GiB) : 1.25

    T : 4
        P (period, ms)       : 3
        runtime overhead (%) : 2.6x ~ 17.85 / 6.81 
        data loss (%)        : 0
        LOST events          : 0
        SAMPLE events        : 665262
        perf.data size (GiB) : 0.78

        P (period, ms)       : 1.8
        runtime overhead (%) : 3x ~ 21.15 / 6.81 
        data loss (%)        : 0
        LOST events          : 0
        SAMPLE events        : 1126563
        perf.data size (GiB) : 1.3

        P (period, ms)       : 1
        runtime overhead (%) : 4.35x ~ 29.6 / 6.81
        data loss (%)        : 0
        LOST events          : 6
        SAMPLE events        : 2124837
        perf.data size (GiB) : 2.5

        P (period, ms)       : 0.8
        runtime overhead (%) : 4.8x ~ 32.62 / 6.81
        data loss (%)        : 12
        LOST events          : 536
        SAMPLE events        : 2620345
        perf.data size (GiB) : 3

Thanks,
Alexey

> 
> Thanks,
> Alexey
> 
>>
>> Thanks,
>> Alexey
>>
>>>
>>>>
>>>> thanks,
>>>> jirka
>>>>
>>>
>>
>

Re: [RFCv2 00/48] perf tools: Add threads to record command

Reply via email to