On 26/08/2022 23:06, Mattias Rönnblom wrote:
On 2022-08-25 17:28, Kevin Laatz wrote:
From: Anatoly Burakov <anatoly.bura...@intel.com>

Currently, there is no way to measure lcore poll busyness in a passive way, without any modifications to the application. This patch adds a new EAL API
that will be able to passively track core polling busyness.

There's no generic way, but the DSW event device keep tracks of lcore utilization (i.e., the fraction of cycles used to perform actual work, as opposed to just polling empty queues), and it does some with the same basic principles as, from what it seems after a quick look, used in this patch.


The poll busyness is calculated by relying on the fact that most DPDK API's
will poll for packets. Empty polls can be counted as "idle", while

Lcore worker threads poll for work. Packets, timeouts, completions, event device events, etc.

Yes, the wording was too restrictive here - the patch includes changes to drivers and libraries such as dmadev, eventdev, ring etc that poll for work and would want to mark it as "idle" or "busy".



non-empty polls can be counted as busy. To measure lcore poll busyness, we

I guess what is meant here is that cycles spent after non-empty polls can be counted as busy (useful) cycles? Potentially including the cycles spent for the actual poll operation. ("Poll busyness" is a very vague term, in my opionion.)

Similiarly, cycles spent after an empty poll would not be counted.

Correct, the generic functionality works this way. Any cycles between an "busy poll" and the next "idle poll" will be counted as busy/useful work (and vice versa).



simply call the telemetry timestamping function with the number of polls a particular code section has processed, and count the number of cycles we've spent processing empty bursts. The more empty bursts we encounter, the less
cycles we spend in "busy" state, and the less core poll busyness will be
reported.


Is this the same scheme as DSW? When a non-zero burst in idle state means a transition from the idle to busy? And a zero burst poll in busy state means a transition from idle to busy?

The issue with this scheme, is that you might potentially end up with a state transition for every iteration of the application's main loop, if packets (or other items of work) only comes in on one of the lcore's potentially many RX queues (or other input queues, such as eventdev ports). That means a rdtsc for every loop, which isn't too bad, but still might be noticable.

An application that gather items of work from multiple source before actually doing anything breaks this model. For example, consider a lcore worker owning two RX queues, performing rte_eth_rx_burst() on both, before attempt to process any of the received packets. If the last poll is empty, the cycles spent will considered idle, even though they were busy.

A lcore worker might also decide to poll the same RX queue multiple times (until it hits an empty poll, or reaching some high upper bound), before initating processing of the packets.

Yes, more complex applications will need to be modified to gain a more fine-grained busyness metric. In order to achieve this level of accuracy, application context is required. The 'RTE_LCORE_POLL_BUSYNESS_TIMESTAMP()' macro can be used within the application to mark sections as "busy" or "not busy" to do so. Using your example above, the application could keep track of multiple bursts (whether they have work or not) and call the macro before initiating the processing to signal that there is, in fact, work to be done.

There's a section in the documentation update in this patchset that describes it. It might need more work if its not clear :-)



I didn't read your code in detail, so I might be jumping to conclusions.

In order for all of the above to work without modifications to the
application, the library code needs to be instrumented with calls to the
lcore telemetry busyness timestamping function. The following parts of DPDK
are instrumented with lcore telemetry calls:

- All major driver API's:
   - ethdev
   - cryptodev
   - compressdev
   - regexdev
   - bbdev
   - rawdev
   - eventdev
   - dmadev
- Some additional libraries:
   - ring
   - distributor

In the past, I've suggested this kind of functionality should go into the service framework instead, with the service function explicitly signaling wheter or not the cycles spent on something useful or not.

That seems to me like a more straight-forward and more accurate solution, but does require the application to deploy everything as services, and also requires a change of the service function signature.


To avoid performance impact from having lcore telemetry support, a global variable is exported by EAL, and a call to timestamping function is wrapped
into a macro, so that whenever telemetry is disabled, it only takes one

Use an static inline function if you don't need the additional expressive power of a macro.

I suggest you also mention the performance implications, when this function is enabled.

Sure, I can add a note in the next revision.



additional branch and no function calls are performed. It is also possible
to disable it at compile time by commenting out RTE_LCORE_BUSYNESS from
build config.

This patch also adds a telemetry endpoint to report lcore poll busyness, as
well as telemetry endpoints to enable/disable lcore telemetry. A
documentation entry has been added to the howto guides to explain the usage
of the new telemetry endpoints and API.


Should there really be a dependency from the EAL to the telemetry library? A cycle. Maybe some dependency inversion would be in order? The telemetry library could instead register an interest in getting busy/idle cycles reports from lcores.

Signed-off-by: Kevin Laatz <kevin.la...@intel.com>
Signed-off-by: Conor Walsh <conor.wa...@intel.com>
Signed-off-by: David Hunt <david.h...@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.bura...@intel.com>

---
v3:
   * Fix missed renaming to poll busyness
   * Fix clang compilation
   * Fix arm compilation

v2:
   * Use rte_get_tsc_hz() to adjust the telemetry period
   * Rename to reflect polling busyness vs general busyness
   * Fix segfault when calling telemetry timestamp from an unregistered
     non-EAL thread.
   * Minor cleanup
---
  config/meson.build                          |   1 +
  config/rte_config.h                         |   1 +
  lib/bbdev/rte_bbdev.h                       |  17 +-
  lib/compressdev/rte_compressdev.c           |   2 +
  lib/cryptodev/rte_cryptodev.h               |   2 +
  lib/distributor/rte_distributor.c           |  21 +-
  lib/distributor/rte_distributor_single.c    |  14 +-
  lib/dmadev/rte_dmadev.h                     |  15 +-
  lib/eal/common/eal_common_lcore_telemetry.c | 293 ++++++++++++++++++++
  lib/eal/common/meson.build                  |   1 +
  lib/eal/include/rte_lcore.h                 |  80 ++++++
  lib/eal/meson.build                         |   3 +
  lib/eal/version.map                         |   7 +
  lib/ethdev/rte_ethdev.h                     |   2 +
  lib/eventdev/rte_eventdev.h                 |  10 +-
  lib/rawdev/rte_rawdev.c                     |   6 +-
  lib/regexdev/rte_regexdev.h                 |   5 +-
  lib/ring/rte_ring_elem_pvt.h                |   1 +
  meson_options.txt                           |   2 +
  19 files changed, 459 insertions(+), 24 deletions(-)
  create mode 100644 lib/eal/common/eal_common_lcore_telemetry.c

<snip>
diff --git a/lib/eal/common/eal_common_lcore_telemetry.c b/lib/eal/common/eal_common_lcore_telemetry.c
new file mode 100644
index 0000000000..bba0afc26d
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_telemetry.c
@@ -0,0 +1,293 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2014 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <limits.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_errno.h>
+#include <rte_lcore.h>
+
+#ifdef RTE_LCORE_POLL_BUSYNESS
+#include <rte_telemetry.h>
+#endif
+
+int __rte_lcore_telemetry_enabled;

Is "telemetry" really the term to use here? Isn't this just another piece of statistics? It can be used for telemetry, or in some other fashion.

(Use bool not int.)

Can rename to '__rte_lcore_stats_enabled' in next revision.



+
+#ifdef RTE_LCORE_POLL_BUSYNESS
+
+struct lcore_telemetry {
+    int poll_busyness;
+    /**< Calculated poll busyness (gets set/returned by the API) */
+    int raw_poll_busyness;
+    /**< Calculated poll busyness times 100. */
+    uint64_t interval_ts;
+    /**< when previous telemetry interval started */
+    uint64_t empty_cycles;
+    /**< empty cycle count since last interval */
+    uint64_t last_poll_ts;
+    /**< last poll timestamp */
+    bool last_empty;
+    /**< if last poll was empty */
+    unsigned int contig_poll_cnt;
+    /**< contiguous (always empty/non empty) poll counter */
+} __rte_cache_aligned;
+
+static struct lcore_telemetry *telemetry_data;
+
+#define LCORE_POLL_BUSYNESS_MAX 100
+#define LCORE_POLL_BUSYNESS_NOT_SET -1
+#define LCORE_POLL_BUSYNESS_MIN 0
+
+#define SMOOTH_COEFF 5
+#define STATE_CHANGE_OPT 32
+
+static void lcore_config_init(void)
+{
+    int lcore_id;
+
+    telemetry_data = calloc(RTE_MAX_LCORE, sizeof(telemetry_data[0]));
+    if (telemetry_data == NULL)
+        rte_panic("Could not init lcore telemetry data: Out of memory\n");
+
+    RTE_LCORE_FOREACH(lcore_id) {
+        struct lcore_telemetry *td = &telemetry_data[lcore_id];
+
+        td->interval_ts = 0;
+        td->last_poll_ts = 0;
+        td->empty_cycles = 0;
+        td->last_empty = true;
+        td->contig_poll_cnt = 0;
+        td->poll_busyness = LCORE_POLL_BUSYNESS_NOT_SET;
+        td->raw_poll_busyness = 0;
+    }
+}
+
+int rte_lcore_poll_busyness(unsigned int lcore_id)
+{
+    const uint64_t active_thresh = rte_get_tsc_hz() * RTE_LCORE_POLL_BUSYNESS_PERIOD_MS;
+    struct lcore_telemetry *tdata;
+
+    if (lcore_id >= RTE_MAX_LCORE)
+        return -EINVAL;
+    tdata = &telemetry_data[lcore_id];
+
+    /* if the lcore is not active */
+    if (tdata->interval_ts == 0)
+        return LCORE_POLL_BUSYNESS_NOT_SET;
+    /* if the core hasn't been active in a while */
+    else if ((rte_rdtsc() - tdata->interval_ts) > active_thresh)
+        return LCORE_POLL_BUSYNESS_NOT_SET;
+
+    /* this core is active, report its poll busyness */
+    return telemetry_data[lcore_id].poll_busyness;
+}
+
+int rte_lcore_poll_busyness_enabled(void)
+{
+    return __rte_lcore_telemetry_enabled;
+}
+
+void rte_lcore_poll_busyness_enabled_set(int enable)

Use bool.

+{
+    __rte_lcore_telemetry_enabled = !!enable;

!!Another reason to use bool!! :)

Are you allowed to call this function during operation, you'll need a atomic store here (and an atomic load on the read side).

Ack



+
+    if (!enable)
+        lcore_config_init();
+}
+
+static inline int calc_raw_poll_busyness(const struct lcore_telemetry *tdata,
+                    const uint64_t empty, const uint64_t total)
+{
+    /*
+     * we don't want to use floating point math here, but we want for our poll +     * busyness to react smoothly to sudden changes, while still keeping the +     * accuracy and making sure that over time the average follows poll busyness +     * as measured just-in-time. therefore, we will calculate the average poll +     * busyness using integer math, but shift the decimal point two places +     * to the right, so that 100.0 becomes 10000. this allows us to report +     * integer values (0..100) while still allowing ourselves to follow the
+     * just-in-time measurements when we calculate our averages.
+     */
+    const int max_raw_idle = LCORE_POLL_BUSYNESS_MAX * 100;
+

Why not just store/manage the number of busy (or idle, or both) cycles? Then the user can decide what time period to average over, to what extent the lcore utilization from previous periods should be factored in, etc.

There's an option 'RTE_LCORE_POLL_BUSYNESS_PERIOD_MS' added to rte_config.h which would allow the user to define the time period over which the utilization should be reported. We only do this calculation if that time interval has elapsed.



In DSW, I initially presented only a load statistic (which averaged over 250 us, with some contribution from previous period). I later came to realize that just exposing the number of busy cycles left the calling application much more options. For example, to present the average load during 1 s, you needed to have some control thread sampling the load statistic during that time period, as opposed to when the busy cycles statistic was introduced, it just had to read that value twice (at the beginning of the period, and at the end), and compared it will the amount of wallclock time passed.

<snip>

Reply via email to