On 2022-07-11 15:18, Harry van Haaren wrote:
This commit improves the performance reporting of the service
cores polling loop to show both with and without statistics
collection modes. Collecting cycle statistics is costly, due
to calls to rte_rdtsc() per service iteration.

That is true for a service deployed on only a single core. For multi-core services, non-rdtsc-related overhead dominates. For example, if the service is deployed on 11 cores, the extra statistics-related overhead is ~1000 cc/service call on x86_64. 2x rdtsc shouldn't be more than ~50 cc.


Reported-by: Mattias Rönnblom <mattias.ronnb...@ericsson.com>
Suggested-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>
Suggested-by: Morten Brørup <m...@smartsharesystems.com>
Signed-off-by: Harry van Haaren <harry.van.haa...@intel.com>

---

This is split out as a seperate patch from the fix to allow
measuring the before/after of the service stats atomic fixup.
---
  app/test/test_service_cores.c | 36 ++++++++++++++++++++++++-----------
  1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/app/test/test_service_cores.c b/app/test/test_service_cores.c
index ced6ed0081..7415b6b686 100644
--- a/app/test/test_service_cores.c
+++ b/app/test/test_service_cores.c
@@ -777,6 +777,22 @@ service_run_on_app_core_func(void *arg)
        return rte_service_run_iter_on_app_lcore(*delay_service_id, 1);
  }
+static float
+service_app_lcore_perf_measure(uint32_t id)
+{
+       /* Performance test: call in a loop, and measure tsc() */
+       const uint32_t perf_iters = (1 << 12);
+       uint64_t start = rte_rdtsc();
+       uint32_t i;
+       for (i = 0; i < perf_iters; i++) {
+               int err = service_run_on_app_core_func(&id);

In a real-world scenario, the latency of this function isn't representative for the overall service core overhead.

For example, consider a scenario where an lcore has a single service mapped to it. rte_service.c will call service_run() 64 times, but only one will be a "hit" and the service being run. One iteration in the service loop costs ~600 cc, on a machine where this performance benchmark reports 128 cc. (Both with statistics disabled.)

For low-latency services, this is a significant overhead.

+               TEST_ASSERT_EQUAL(0, err, "perf test: returned run failure");
+       }
+       uint64_t end = rte_rdtsc();
+
+       return (end - start)/(float)perf_iters;
+}
+
  static int
  service_app_lcore_poll_impl(const int mt_safe)
  {
@@ -828,17 +844,15 @@ service_app_lcore_poll_impl(const int mt_safe)
                                "MT Unsafe: App core1 didn't return -EBUSY");
        }
- /* Performance test: call in a loop, and measure tsc() */
-       const uint32_t perf_iters = (1 << 12);
-       uint64_t start = rte_rdtsc();
-       uint32_t i;
-       for (i = 0; i < perf_iters; i++) {
-               int err = service_run_on_app_core_func(&id);
-               TEST_ASSERT_EQUAL(0, err, "perf test: returned run failure");
-       }
-       uint64_t end = rte_rdtsc();
-       printf("perf test for %s: %0.1f cycles per call\n", mt_safe ?
-               "MT Safe" : "MT Unsafe", (end - start)/(float)perf_iters);
+       /* Measure performance of no-stats and with-stats. */
+       float cyc_no_stats = service_app_lcore_perf_measure(id);
+
+       TEST_ASSERT_EQUAL(0, rte_service_set_stats_enable(id, 1),
+                               "failed to enable stats for service.");
+       float cyc_with_stats = service_app_lcore_perf_measure(id);
+
+       printf("perf test for %s, no stats: %0.1f, with stats %0.1f 
cycles/call\n",
+               mt_safe ? "MT Safe" : "MT Unsafe", cyc_no_stats, 
cyc_with_stats);
unregister_all();
        return TEST_SUCCESS;

Reply via email to