On Fri, Sep 13, 2024 at 8:10 PM Morten Brørup <m...@smartsharesystems.com> wrote: > > > From: Jerin Jacob [mailto:jerinjac...@gmail.com] > > Sent: Friday, 13 September 2024 13.24 > > > > On Fri, Sep 13, 2024 at 12:17 PM Mattias Rönnblom <hof...@lysator.liu.se> > > wrote: > > > > > > On 2024-09-12 17:11, Jerin Jacob wrote: > > > > On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom <hof...@lysator.liu.se> > > wrote: > > > >> > > > >> On 2024-09-12 15:09, Jerin Jacob wrote: > > > >>> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom > > > >>> <mattias.ronnb...@ericsson.com> wrote: > > > >>>> +static double > > > >>>> +benchmark_access_method(void (*init_fun)(void), void > > (*update_fun)(void)) > > > >>>> +{ > > > >>>> + uint64_t i; > > > >>>> + uint64_t start; > > > >>>> + uint64_t end; > > > >>>> + double latency; > > > >>>> + > > > >>>> + init_fun(); > > > >>>> + > > > >>>> + start = rte_get_timer_cycles(); > > > >>>> + > > > >>>> + for (i = 0; i < ITERATIONS; i++) > > > >>>> + update_fun(); > > > >>>> + > > > >>>> + end = rte_get_timer_cycles(); > > > >>> > > > >>> Use precise variant. rte_rdtsc_precise() or so to be accurate > > > >> > > > >> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not. > > > > > > > > I was thinking in another way, with 1e7 iteration, the additional > > > > barrier on precise will be amortized, and we get more _deterministic_ > > > > behavior e.s.p in case if we print cycles and if we need to catch > > > > regressions. > > > > > > If you time a section of code which spends ~40000000 cycles, it doesn't > > > matter if you add or remove a few cycles at the beginning and the end. > > > > > > The rte_rdtsc_precise() is both better (more precise in the sense of > > > more serialization), and worse (because it's more costly, and thus more > > > intrusive). > > > > We can calibrate the overhead to remove the cost. > > > > > > > > You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It > > > doesn't matter. > > > > Yes. In this setup and it is pretty inaccurate PER iteration. Please > > refer to the below patch to see the difference. > > No, Mattias is right. The time is sampled once before the loop, then the > function is executed 10 million (ITERATIONS) times in the loop, and then the > time is sampled once again.
No. I am not disagreeing. That why I said, “Yes. In this setup”. All I am saying, there is a more accurate way of doing measurement for this test along with “data” at https://mails.dpdk.org/archives/dev/2024-September/301227.html > > So the overhead and accuracy of the timing function is amortized across the > 10 million calls to the function being measured, and becomes insignificant. > > Other perf tests also do it this way, and also use rte_get_timer_cycles(). > E.g. the mempool_perf test. > > Another detail: The for loop itself may cost a few cycles, which may not be > irrelevant when measuring a function using very few cycles. If the compiler > doesn't unroll the loop, it should be done manually: > > for (i = 0; i < ITERATIONS / 100; i++) { > update_fun(); > update_fun(); > ... repeated 100 times I have done a similar scheme for trace perf for inline function test at https://github.com/DPDK/dpdk/blob/main/app/test/test_trace_perf.c#L30 Either the above scheme or the below scheme needs to be used as mentioned in https://mails.dpdk.org/archives/dev/2024-September/301227.html + for (i = 0; i < ITERATIONS; i++) { + start = rte_rdtsc_precise(); update_fun(); + end = rte_rdtsc_precise(); + latency += (end - start) - tsc_latency; + } > } > > > > > > Patch 1: Make nanoseconds to cycles per iteration > > ------------------------------------------------------------------ > > > > diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c > > index ea1d7ba90b52..b8d25400f593 100644 > > --- a/app/test/test_lcore_var_perf.c > > +++ b/app/test/test_lcore_var_perf.c > > @@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void), > > void (*update_fun)(void)) > > > > end = rte_get_timer_cycles(); > > > > - latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS; > > + latency = ((end - start)) / ITERATIONS; > > This calculation uses integer arithmetic, which will round down the resulting > latency. > Please use floating point arithmetic: latency = (end - start) / > (double)ITERATIONS; Yup. It is in patch 2 https://mails.dpdk.org/archives/dev/2024-September/301227.html > > > > > return latency; > > } > > @@ -137,8 +137,7 @@ test_lcore_var_access(void) > > > > - printf("Latencies [ns/update]\n"); > > + printf("Latencies [cycles/update]\n"); > > printf("Thread-local storage Static array Lcore variables\n"); > > - printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9, > > - sarray_latency * 1e9, lvar_latency * 1e9); > > + printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency, > > lvar_latency); > > > > return TEST_SUCCESS; > > }