On Fri, Sep 13, 2024 at 8:10 PM Morten Brørup <m...@smartsharesystems.com> 
wrote:
>
> > From: Jerin Jacob [mailto:jerinjac...@gmail.com]
> > Sent: Friday, 13 September 2024 13.24
> >
> > On Fri, Sep 13, 2024 at 12:17 PM Mattias Rönnblom <hof...@lysator.liu.se>
> > wrote:
> > >
> > > On 2024-09-12 17:11, Jerin Jacob wrote:
> > > > On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom <hof...@lysator.liu.se>
> > wrote:
> > > >>
> > > >> On 2024-09-12 15:09, Jerin Jacob wrote:
> > > >>> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
> > > >>> <mattias.ronnb...@ericsson.com> wrote:
> > > >>>> +static double
> > > >>>> +benchmark_access_method(void (*init_fun)(void), void
> > (*update_fun)(void))
> > > >>>> +{
> > > >>>> +       uint64_t i;
> > > >>>> +       uint64_t start;
> > > >>>> +       uint64_t end;
> > > >>>> +       double latency;
> > > >>>> +
> > > >>>> +       init_fun();
> > > >>>> +
> > > >>>> +       start = rte_get_timer_cycles();
> > > >>>> +
> > > >>>> +       for (i = 0; i < ITERATIONS; i++)
> > > >>>> +               update_fun();
> > > >>>> +
> > > >>>> +       end = rte_get_timer_cycles();
> > > >>>
> > > >>> Use precise variant. rte_rdtsc_precise() or so to be accurate
> > > >>
> > > >> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.
> > > >
> > > > I was thinking in another way, with 1e7 iteration, the additional
> > > > barrier on precise will be amortized, and we get more _deterministic_
> > > > behavior e.s.p in case if we print cycles and if we need to catch
> > > > regressions.
> > >
> > > If you time a section of code which spends ~40000000 cycles, it doesn't
> > > matter if you add or remove a few cycles at the beginning and the end.
> > >
> > > The rte_rdtsc_precise() is both better (more precise in the sense of
> > > more serialization), and worse (because it's more costly, and thus more
> > > intrusive).
> >
> > We can calibrate the overhead to remove the cost.
> >
> > >
> > > You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It
> > > doesn't matter.
> >
> > Yes. In this setup and it is pretty inaccurate PER iteration. Please
> > refer to the below patch to see the difference.
>
> No, Mattias is right. The time is sampled once before the loop, then the 
> function is executed 10 million (ITERATIONS) times in the loop, and then the 
> time is sampled once again.

No. I am not disagreeing. That why I said, “Yes. In this setup”.

All I am saying, there is a more accurate way of doing measurement for
this test along with “data” at
https://mails.dpdk.org/archives/dev/2024-September/301227.html


>
> So the overhead and accuracy of the timing function is amortized across the 
> 10 million calls to the function being measured, and becomes insignificant.
>
> Other perf tests also do it this way, and also use rte_get_timer_cycles(). 
> E.g. the mempool_perf test.
>
> Another detail: The for loop itself may cost a few cycles, which may not be 
> irrelevant when measuring a function using very few cycles. If the compiler 
> doesn't unroll the loop, it should be done manually:
>
>         for (i = 0; i < ITERATIONS / 100; i++) {
>                 update_fun();
>                 update_fun();
>                 ... repeated 100 times

I have done a similar scheme for trace perf for inline function test
at https://github.com/DPDK/dpdk/blob/main/app/test/test_trace_perf.c#L30

Either the above scheme or the below scheme needs to be used as
mentioned in https://mails.dpdk.org/archives/dev/2024-September/301227.html

+       for (i = 0; i < ITERATIONS; i++) {
+               start = rte_rdtsc_precise();
                update_fun();
+               end = rte_rdtsc_precise();
+               latency += (end - start) - tsc_latency;
+       }




>         }
>
>
> >
> > Patch 1: Make nanoseconds to cycles per iteration
> > ------------------------------------------------------------------
> >
> > diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
> > index ea1d7ba90b52..b8d25400f593 100644
> > --- a/app/test/test_lcore_var_perf.c
> > +++ b/app/test/test_lcore_var_perf.c
> > @@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void),
> > void (*update_fun)(void))
> >
> >         end = rte_get_timer_cycles();
> >
> > -       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
> > +       latency = ((end - start)) / ITERATIONS;
>
> This calculation uses integer arithmetic, which will round down the resulting 
> latency.
> Please use floating point arithmetic: latency = (end - start) / 
> (double)ITERATIONS;

Yup. It is in patch 2
https://mails.dpdk.org/archives/dev/2024-September/301227.html

>
> >
> >         return latency;
> >  }
> > @@ -137,8 +137,7 @@ test_lcore_var_access(void)
> >
> > -       printf("Latencies [ns/update]\n");
> > +       printf("Latencies [cycles/update]\n");
> >         printf("Thread-local storage  Static array  Lcore variables\n");
> > -       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> > -              sarray_latency * 1e9, lvar_latency * 1e9);
> > +       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> > lvar_latency);
> >
> >         return TEST_SUCCESS;
> >  }

Reply via email to