> From: Jerin Jacob [mailto:jerinjac...@gmail.com]
> Sent: Friday, 13 September 2024 13.24
> 
> On Fri, Sep 13, 2024 at 12:17 PM Mattias Rönnblom <hof...@lysator.liu.se>
> wrote:
> >
> > On 2024-09-12 17:11, Jerin Jacob wrote:
> > > On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom <hof...@lysator.liu.se>
> wrote:
> > >>
> > >> On 2024-09-12 15:09, Jerin Jacob wrote:
> > >>> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
> > >>> <mattias.ronnb...@ericsson.com> wrote:
> > >>>> +static double
> > >>>> +benchmark_access_method(void (*init_fun)(void), void
> (*update_fun)(void))
> > >>>> +{
> > >>>> +       uint64_t i;
> > >>>> +       uint64_t start;
> > >>>> +       uint64_t end;
> > >>>> +       double latency;
> > >>>> +
> > >>>> +       init_fun();
> > >>>> +
> > >>>> +       start = rte_get_timer_cycles();
> > >>>> +
> > >>>> +       for (i = 0; i < ITERATIONS; i++)
> > >>>> +               update_fun();
> > >>>> +
> > >>>> +       end = rte_get_timer_cycles();
> > >>>
> > >>> Use precise variant. rte_rdtsc_precise() or so to be accurate
> > >>
> > >> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.
> > >
> > > I was thinking in another way, with 1e7 iteration, the additional
> > > barrier on precise will be amortized, and we get more _deterministic_
> > > behavior e.s.p in case if we print cycles and if we need to catch
> > > regressions.
> >
> > If you time a section of code which spends ~40000000 cycles, it doesn't
> > matter if you add or remove a few cycles at the beginning and the end.
> >
> > The rte_rdtsc_precise() is both better (more precise in the sense of
> > more serialization), and worse (because it's more costly, and thus more
> > intrusive).
> 
> We can calibrate the overhead to remove the cost.
> 
> >
> > You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It
> > doesn't matter.
> 
> Yes. In this setup and it is pretty inaccurate PER iteration. Please
> refer to the below patch to see the difference.

No, Mattias is right. The time is sampled once before the loop, then the 
function is executed 10 million (ITERATIONS) times in the loop, and then the 
time is sampled once again.

So the overhead and accuracy of the timing function is amortized across the 10 
million calls to the function being measured, and becomes insignificant.

Other perf tests also do it this way, and also use rte_get_timer_cycles(). E.g. 
the mempool_perf test.

Another detail: The for loop itself may cost a few cycles, which may not be 
irrelevant when measuring a function using very few cycles. If the compiler 
doesn't unroll the loop, it should be done manually:

        for (i = 0; i < ITERATIONS / 100; i++) {
                update_fun();
                update_fun();
                ... repeated 100 times
        }


> 
> Patch 1: Make nanoseconds to cycles per iteration
> ------------------------------------------------------------------
> 
> diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
> index ea1d7ba90b52..b8d25400f593 100644
> --- a/app/test/test_lcore_var_perf.c
> +++ b/app/test/test_lcore_var_perf.c
> @@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void),
> void (*update_fun)(void))
> 
>         end = rte_get_timer_cycles();
> 
> -       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
> +       latency = ((end - start)) / ITERATIONS;

This calculation uses integer arithmetic, which will round down the resulting 
latency.
Please use floating point arithmetic: latency = (end - start) / 
(double)ITERATIONS;

> 
>         return latency;
>  }
> @@ -137,8 +137,7 @@ test_lcore_var_access(void)
> 
> -       printf("Latencies [ns/update]\n");
> +       printf("Latencies [cycles/update]\n");
>         printf("Thread-local storage  Static array  Lcore variables\n");
> -       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> -              sarray_latency * 1e9, lvar_latency * 1e9);
> +       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> lvar_latency);
> 
>         return TEST_SUCCESS;
>  }

Reply via email to