Hi Morten,

> -----Original Message-----
> From: Morten Brørup <m...@smartsharesystems.com>
> Sent: Tuesday, November 29, 2022 11:43 AM
> To: Tomasz Duszynski <tduszyn...@marvell.com>; dev@dpdk.org
> Cc: tho...@monjalon.net; Jerin Jacob Kollanukkaran <jer...@marvell.com>
> Subject: [EXT] RE: [PATCH v3 0/4] add support for self monitoring
> 
> External Email
> 
> ----------------------------------------------------------------------
> > From: Tomasz Duszynski [mailto:tduszyn...@marvell.com]
> > Sent: Tuesday, 29 November 2022 10.28
> >
> > This series adds self monitoring support i.e allows to configure and
> > read performance measurement unit (PMU) counters in runtime without
> > using perf utility. This has certain adventages when application runs
> > on isolated cores with nohz_full kernel parameter.
> >
> > Events can be read directly using rte_pmu_read() or using dedicated
> > tracepoint rte_eal_trace_pmu_read(). The latter will cause events to
> > be stored inside CTF file.
> >
> > By design, all enabled events are grouped together and the same group
> > is attached to lcores that use self monitoring funtionality.
> >
> > Events are enabled by names, which need to be read from standard
> > location under sysfs i.e
> >
> > /sys/bus/event_source/devices/PMU/events
> >
> > where PMU is a core pmu i.e one measuring cpu events. As of today raw
> > events are not supported.
> 
> Hi Thomasz,
> 
> I am very interested in this patch series for fast path profiling purposes. 
> (Not using EAL trace,
> but our proprietary profiler.)
> 
> However, it seems that rte_pmu_read() is quite longwinded, compared to 
> rte_pmu_pmc_read().
> 

We need some bit of extra logic to set thigs up before performing reading 
actual counter but in reality 
cycles are mostly consumed by rte_pmu_pmc_read(). This obviously differs among 
platforms so if you
want precise measurements you need to get your hands dirty. 

That said, below are results coming from dpdk-test after running 
trace_perf_autotest - just to give you some idea. 

X86-64

RTE>>trace_perf_autotest
Timer running at 3000.00MHz
            void: cycles=17.739375 ns=5.913125
             u64: cycles=17.348296 ns=5.782765
             int: cycles=17.098724 ns=5.699575
           float: cycles=17.099946 ns=5.699982
          double: cycles=17.229702 ns=5.743234
          string: cycles=31.159907 ns=10.386636
         void_fp: cycles=0.679842 ns=0.226614
        read_pmu: cycles=49.325117 ns=16.441706

ARM64 with RTE_ARM_EAL_RDTSC_USE_PMU

RTE>>trace_perf_autotest
Timer running at 2480.00MHz
            void: cycles=9.413568 ns=3.795793
             u64: cycles=9.386003 ns=3.784678
             int: cycles=9.438701 ns=3.805928
           float: cycles=9.359377 ns=3.773942
          double: cycles=9.372279 ns=3.779145
          string: cycles=24.474899 ns=9.868911
         void_fp: cycles=0.505513 ns=0.203836
        read_pmu: cycles=17.442853 ns=7.033409

> But perhaps I am just worrying too much, so I will ask: What is the 
> performance cost of using
> rte_pmu_read() - compared to rte_pmu_pmc_read() - in the fast path?
> 
> If there is a non-negligible difference, could you please provide an example 
> of how to configure
> PMU events and use rte_pmu_pmc_read() in an application?
> 

Series come with some docs so you can check there how to run it. 

> I would primarily be interested in data cache misses and branch 
> mispredictions. But feel free to
> make your own choices for the example.

Raw events are not supported right now which means you don't have fine control 
over all events. 
You can use only events from CPU PMU 
(/sys/bus/event_source/devices/<PMU>/events).


Reply via email to