> From: Jerin Jacob [mailto:jerinjac...@gmail.com]
> Sent: Monday, 16 September 2024 10.12
> 
> On Fri, Sep 13, 2024 at 8:10 PM Morten Brørup <m...@smartsharesystems.com>
> wrote:
> >
> > > From: Jerin Jacob [mailto:jerinjac...@gmail.com]
> > > Sent: Friday, 13 September 2024 13.24
> > >
> > > On Fri, Sep 13, 2024 at 12:17 PM Mattias Rönnblom <hof...@lysator.liu.se>
> > > wrote:
> > > >
> > > > On 2024-09-12 17:11, Jerin Jacob wrote:
> > > > > On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom
> <hof...@lysator.liu.se>
> > > wrote:
> > > > >>
> > > > >> On 2024-09-12 15:09, Jerin Jacob wrote:
> > > > >>> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
> > > > >>> <mattias.ronnb...@ericsson.com> wrote:
> > > > >>>> +static double
> > > > >>>> +benchmark_access_method(void (*init_fun)(void), void
> > > (*update_fun)(void))
> > > > >>>> +{
> > > > >>>> +       uint64_t i;
> > > > >>>> +       uint64_t start;
> > > > >>>> +       uint64_t end;
> > > > >>>> +       double latency;
> > > > >>>> +
> > > > >>>> +       init_fun();
> > > > >>>> +
> > > > >>>> +       start = rte_get_timer_cycles();
> > > > >>>> +
> > > > >>>> +       for (i = 0; i < ITERATIONS; i++)
> > > > >>>> +               update_fun();
> > > > >>>> +
> > > > >>>> +       end = rte_get_timer_cycles();
> > > > >>>
> > > > >>> Use precise variant. rte_rdtsc_precise() or so to be accurate
> > > > >>
> > > > >> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.
> > > > >
> > > > > I was thinking in another way, with 1e7 iteration, the additional
> > > > > barrier on precise will be amortized, and we get more _deterministic_
> > > > > behavior e.s.p in case if we print cycles and if we need to catch
> > > > > regressions.
> > > >
> > > > If you time a section of code which spends ~40000000 cycles, it doesn't
> > > > matter if you add or remove a few cycles at the beginning and the end.
> > > >
> > > > The rte_rdtsc_precise() is both better (more precise in the sense of
> > > > more serialization), and worse (because it's more costly, and thus more
> > > > intrusive).
> > >
> > > We can calibrate the overhead to remove the cost.
> > >
> > > >
> > > > You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It
> > > > doesn't matter.
> > >
> > > Yes. In this setup and it is pretty inaccurate PER iteration. Please
> > > refer to the below patch to see the difference.
> >
> > No, Mattias is right. The time is sampled once before the loop, then the
> function is executed 10 million (ITERATIONS) times in the loop, and then the
> time is sampled once again.
> 
> No. I am not disagreeing. That why I said, “Yes. In this setup”.

Sorry, I misunderstood. Then we're all on the same page here. :-)

> 
> All I am saying, there is a more accurate way of doing measurement for
> this test along with “data” at
> https://mails.dpdk.org/archives/dev/2024-September/301227.html
> 
> 
> >
> > So the overhead and accuracy of the timing function is amortized across the
> 10 million calls to the function being measured, and becomes insignificant.
> >
> > Other perf tests also do it this way, and also use rte_get_timer_cycles().
> E.g. the mempool_perf test.
> >
> > Another detail: The for loop itself may cost a few cycles, which may not be
> irrelevant when measuring a function using very few cycles. If the compiler
> doesn't unroll the loop, it should be done manually:
> >
> >         for (i = 0; i < ITERATIONS / 100; i++) {
> >                 update_fun();
> >                 update_fun();
> >                 ... repeated 100 times
> 
> I have done a similar scheme for trace perf for inline function test
> at https://github.com/DPDK/dpdk/blob/main/app/test/test_trace_perf.c#L30

Nice macro. :-)

> 
> Either the above scheme or the below scheme needs to be used as
> mentioned in https://mails.dpdk.org/archives/dev/2024-September/301227.html
> 
> +       for (i = 0; i < ITERATIONS; i++) {
> +               start = rte_rdtsc_precise();
>                 update_fun();
> +               end = rte_rdtsc_precise();
> +               latency += (end - start) - tsc_latency;
> +       }
> 

I prefer reading the timestamps outside the loop.
If there is any jitter in the execution time (or cycles used) by 
rte_rdtsc_precise(), it gets amortized when used outside the loop. If used 
inside the loop, the jitter adds up, and may affect the result.

On the other hand, I guess using rte_rdtsc_precise() inside the loop may show 
different results, due to its memory barriers. I don't know; just speculating.

Maybe we want to use both methods to measure this? Considering that we are 
measuring the time to access frequently used variables in hot parts of the 
code, as implemented by three different design patterns. Performance here is 
quite important.

And if we want to subtract the overhead from rte_rdtsc_precise() itself - which 
I think is a good idea if used inside the loop - we probably need another loop 
to measure that, rather than just calling it twice and subtracting the returned 
values.

> 
> 
> 
> >         }
> >
> >
> > >
> > > Patch 1: Make nanoseconds to cycles per iteration
> > > ------------------------------------------------------------------
> > >
> > > diff --git a/app/test/test_lcore_var_perf.c
> b/app/test/test_lcore_var_perf.c
> > > index ea1d7ba90b52..b8d25400f593 100644
> > > --- a/app/test/test_lcore_var_perf.c
> > > +++ b/app/test/test_lcore_var_perf.c
> > > @@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void),
> > > void (*update_fun)(void))
> > >
> > >         end = rte_get_timer_cycles();
> > >
> > > -       latency = ((end - start) / (double)rte_get_timer_hz()) /
> ITERATIONS;
> > > +       latency = ((end - start)) / ITERATIONS;
> >
> > This calculation uses integer arithmetic, which will round down the
> resulting latency.
> > Please use floating point arithmetic: latency = (end - start) /
> (double)ITERATIONS;
> 
> Yup. It is in patch 2
> https://mails.dpdk.org/archives/dev/2024-September/301227.html

Yep; my comment was mostly meant for Mattias, if he switched from nanoseconds 
to cycles, to remember using floating point calculation here.

> 
> >
> > >
> > >         return latency;
> > >  }
> > > @@ -137,8 +137,7 @@ test_lcore_var_access(void)
> > >
> > > -       printf("Latencies [ns/update]\n");
> > > +       printf("Latencies [cycles/update]\n");
> > >         printf("Thread-local storage  Static array  Lcore variables\n");
> > > -       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> > > -              sarray_latency * 1e9, lvar_latency * 1e9);
> > > +       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> > > lvar_latency);
> > >
> > >         return TEST_SUCCESS;
> > >  }

Reply via email to