On Fri, Sep 13, 2024 at 12:17 PM Mattias Rönnblom <hof...@lysator.liu.se> wrote:
>
> On 2024-09-12 17:11, Jerin Jacob wrote:
> > On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom <hof...@lysator.liu.se> 
> > wrote:
> >>
> >> On 2024-09-12 15:09, Jerin Jacob wrote:
> >>> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
> >>> <mattias.ronnb...@ericsson.com> wrote:
> >>>>
> >>>> Add basic micro benchmark for lcore variables, in an attempt to assure
> >>>> that the overhead isn't significantly greater than alternative
> >>>> approaches, in scenarios where the benefits aren't expected to show up
> >>>> (i.e., when plenty of cache is available compared to the working set
> >>>> size of the per-lcore data).
> >>>>
> >>>> Signed-off-by: Mattias Rönnblom <mattias.ronnb...@ericsson.com>
> >>>> ---
> >>>>    app/test/meson.build           |   1 +
> >>>>    app/test/test_lcore_var_perf.c | 160 +++++++++++++++++++++++++++++++++
> >>>>    2 files changed, 161 insertions(+)
> >>>>    create mode 100644 app/test/test_lcore_var_perf.c
> >>>
> >>>
> >>>> +static double
> >>>> +benchmark_access_method(void (*init_fun)(void), void 
> >>>> (*update_fun)(void))
> >>>> +{
> >>>> +       uint64_t i;
> >>>> +       uint64_t start;
> >>>> +       uint64_t end;
> >>>> +       double latency;
> >>>> +
> >>>> +       init_fun();
> >>>> +
> >>>> +       start = rte_get_timer_cycles();
> >>>> +
> >>>> +       for (i = 0; i < ITERATIONS; i++)
> >>>> +               update_fun();
> >>>> +
> >>>> +       end = rte_get_timer_cycles();
> >>>
> >>> Use precise variant. rte_rdtsc_precise() or so to be accurate
> >>
> >> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.
> >
> > I was thinking in another way, with 1e7 iteration, the additional
> > barrier on precise will be amortized, and we get more _deterministic_
> > behavior e.s.p in case if we print cycles and if we need to catch
> > regressions.
>
> If you time a section of code which spends ~40000000 cycles, it doesn't
> matter if you add or remove a few cycles at the beginning and the end.
>
> The rte_rdtsc_precise() is both better (more precise in the sense of
> more serialization), and worse (because it's more costly, and thus more
> intrusive).

We can calibrate the overhead to remove the cost.

>
> You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It
> doesn't matter.

Yes. In this setup and it is pretty inaccurate PER iteration. Please
refer to the below patch to see the difference.

Patch 1: Make nanoseconds to cycles per iteration
------------------------------------------------------------------

diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
index ea1d7ba90b52..b8d25400f593 100644
--- a/app/test/test_lcore_var_perf.c
+++ b/app/test/test_lcore_var_perf.c
@@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void),
void (*update_fun)(void))

        end = rte_get_timer_cycles();

-       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
+       latency = ((end - start)) / ITERATIONS;

        return latency;
 }
@@ -137,8 +137,7 @@ test_lcore_var_access(void)

-       printf("Latencies [ns/update]\n");
+       printf("Latencies [cycles/update]\n");
        printf("Thread-local storage  Static array  Lcore variables\n");
-       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
-              sarray_latency * 1e9, lvar_latency * 1e9);
+       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
lvar_latency);

        return TEST_SUCCESS;
 }


Patch 2: Change to precise with calibration
-----------------------------------------------------------

diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
index ea1d7ba90b52..8142ecd56241 100644
--- a/app/test/test_lcore_var_perf.c
+++ b/app/test/test_lcore_var_perf.c
@@ -96,23 +96,28 @@ lvar_update(void)
 static double
 benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
 {
-       uint64_t i;
+       double tsc_latency;
+       double latency;
        uint64_t start;
        uint64_t end;
-       double latency;
+       uint64_t i;

-       init_fun();
+       /* calculate rte_rdtsc_precise overhead */
+       start = rte_rdtsc_precise();
+       end = rte_rdtsc_precise();
+       tsc_latency = (end - start);

-       start = rte_get_timer_cycles();
+       init_fun();

-       for (i = 0; i < ITERATIONS; i++)
+       latency = 0;
+       for (i = 0; i < ITERATIONS; i++) {
+               start = rte_rdtsc_precise();
                update_fun();
+               end = rte_rdtsc_precise();
+               latency += (end - start) - tsc_latency;
+       }

-       end = rte_get_timer_cycles();
-
-       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
-
-       return latency;
+       return latency / (double)ITERATIONS;
 }

 static int
@@ -135,10 +140,9 @@ test_lcore_var_access(void)
        sarray_latency = benchmark_access_method(sarray_init, sarray_update);
        lvar_latency = benchmark_access_method(lvar_init, lvar_update);

-       printf("Latencies [ns/update]\n");
+       printf("Latencies [cycles/update]\n");
        printf("Thread-local storage  Static array  Lcore variables\n");
-       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
-              sarray_latency * 1e9, lvar_latency * 1e9);
+       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
lvar_latency);

        return TEST_SUCCESS;
 }

ARM N2 core with patch 1(aka current scheme)
-----------------------------------

 + ------------------------------------------------------- +
 + Test Suite : lcore variable perf autotest
 + ------------------------------------------------------- +
Latencies [cycles/update]
Thread-local storage  Static array  Lcore variables
                 7.0           7.0              7.0


ARM N2 core with patch 2
-----------------------------------

 + ------------------------------------------------------- +
 + Test Suite : lcore variable perf autotest
 + ------------------------------------------------------- +
Latencies [cycles/update]
Thread-local storage  Static array  Lcore variables
                11.4          15.5             15.5

x86 i9 core with patch 1(aka current scheme)
------------------------------------------------------------

 + ------------------------------------------------------- +
 + Test Suite : lcore variable perf autotest
 + ------------------------------------------------------- +
Latencies [ns/update]
Thread-local storage  Static array  Lcore variables
                 5.0           6.0              6.0

x86 i9 core with patch 2
--------------------------------
 + ------------------------------------------------------- +
 + Test Suite : lcore variable perf autotest
 + ------------------------------------------------------- +
Latencies [cycles/update]
Thread-local storage  Static array  Lcore variables
                 5.3          10.6             11.7





>
> > Furthermore, you may consider replacing rte_random() in fast path to
> > running number or so if it is not deterministic in cycle computation.
>
> rte_rand() is not used in the fast path. I don't understand what you

I missed that. Ignore this comment.

> mean by "running number".

Reply via email to