On 26/08/2022 23:06, Mattias Rönnblom wrote:
On 2022-08-25 17:28, Kevin Laatz wrote:
From: Anatoly Burakov <anatoly.bura...@intel.com>
Currently, there is no way to measure lcore poll busyness in a
passive way,
without any modifications to the application. This patch adds a new
EAL API
that will be able to passively track core polling busyness.
There's no generic way, but the DSW event device keep tracks of lcore
utilization (i.e., the fraction of cycles used to perform actual work,
as opposed to just polling empty queues), and it does some with the
same basic principles as, from what it seems after a quick look, used
in this patch.
The poll busyness is calculated by relying on the fact that most DPDK
API's
will poll for packets. Empty polls can be counted as "idle", while
Lcore worker threads poll for work. Packets, timeouts, completions,
event device events, etc.
Yes, the wording was too restrictive here - the patch includes changes
to drivers and libraries such as dmadev, eventdev, ring etc that poll
for work and would want to mark it as "idle" or "busy".
non-empty polls can be counted as busy. To measure lcore poll
busyness, we
I guess what is meant here is that cycles spent after non-empty polls
can be counted as busy (useful) cycles? Potentially including the
cycles spent for the actual poll operation. ("Poll busyness" is a very
vague term, in my opionion.)
Similiarly, cycles spent after an empty poll would not be counted.
Correct, the generic functionality works this way. Any cycles between an
"busy poll" and the next "idle poll" will be counted as busy/useful work
(and vice versa).
simply call the telemetry timestamping function with the number of
polls a
particular code section has processed, and count the number of cycles
we've
spent processing empty bursts. The more empty bursts we encounter,
the less
cycles we spend in "busy" state, and the less core poll busyness will be
reported.
Is this the same scheme as DSW? When a non-zero burst in idle state
means a transition from the idle to busy? And a zero burst poll in
busy state means a transition from idle to busy?
The issue with this scheme, is that you might potentially end up with
a state transition for every iteration of the application's main loop,
if packets (or other items of work) only comes in on one of the
lcore's potentially many RX queues (or other input queues, such as
eventdev ports). That means a rdtsc for every loop, which isn't too
bad, but still might be noticable.
An application that gather items of work from multiple source before
actually doing anything breaks this model. For example, consider a
lcore worker owning two RX queues, performing rte_eth_rx_burst() on
both, before attempt to process any of the received packets. If the
last poll is empty, the cycles spent will considered idle, even though
they were busy.
A lcore worker might also decide to poll the same RX queue multiple
times (until it hits an empty poll, or reaching some high upper
bound), before initating processing of the packets.
Yes, more complex applications will need to be modified to gain a more
fine-grained busyness metric. In order to achieve this level of
accuracy, application context is required. The
'RTE_LCORE_POLL_BUSYNESS_TIMESTAMP()' macro can be used within the
application to mark sections as "busy" or "not busy" to do so. Using
your example above, the application could keep track of multiple bursts
(whether they have work or not) and call the macro before initiating the
processing to signal that there is, in fact, work to be done.
There's a section in the documentation update in this patchset that
describes it. It might need more work if its not clear :-)
I didn't read your code in detail, so I might be jumping to conclusions.
In order for all of the above to work without modifications to the
application, the library code needs to be instrumented with calls to the
lcore telemetry busyness timestamping function. The following parts
of DPDK
are instrumented with lcore telemetry calls:
- All major driver API's:
- ethdev
- cryptodev
- compressdev
- regexdev
- bbdev
- rawdev
- eventdev
- dmadev
- Some additional libraries:
- ring
- distributor
In the past, I've suggested this kind of functionality should go into
the service framework instead, with the service function explicitly
signaling wheter or not the cycles spent on something useful or not.
That seems to me like a more straight-forward and more accurate
solution, but does require the application to deploy everything as
services, and also requires a change of the service function signature.
To avoid performance impact from having lcore telemetry support, a
global
variable is exported by EAL, and a call to timestamping function is
wrapped
into a macro, so that whenever telemetry is disabled, it only takes one
Use an static inline function if you don't need the additional
expressive power of a macro.
I suggest you also mention the performance implications, when this
function is enabled.
Sure, I can add a note in the next revision.
additional branch and no function calls are performed. It is also
possible
to disable it at compile time by commenting out RTE_LCORE_BUSYNESS from
build config.
This patch also adds a telemetry endpoint to report lcore poll
busyness, as
well as telemetry endpoints to enable/disable lcore telemetry. A
documentation entry has been added to the howto guides to explain the
usage
of the new telemetry endpoints and API.
Should there really be a dependency from the EAL to the telemetry
library? A cycle. Maybe some dependency inversion would be in order?
The telemetry library could instead register an interest in getting
busy/idle cycles reports from lcores.
Signed-off-by: Kevin Laatz <kevin.la...@intel.com>
Signed-off-by: Conor Walsh <conor.wa...@intel.com>
Signed-off-by: David Hunt <david.h...@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.bura...@intel.com>
---
v3:
* Fix missed renaming to poll busyness
* Fix clang compilation
* Fix arm compilation
v2:
* Use rte_get_tsc_hz() to adjust the telemetry period
* Rename to reflect polling busyness vs general busyness
* Fix segfault when calling telemetry timestamp from an unregistered
non-EAL thread.
* Minor cleanup
---
config/meson.build | 1 +
config/rte_config.h | 1 +
lib/bbdev/rte_bbdev.h | 17 +-
lib/compressdev/rte_compressdev.c | 2 +
lib/cryptodev/rte_cryptodev.h | 2 +
lib/distributor/rte_distributor.c | 21 +-
lib/distributor/rte_distributor_single.c | 14 +-
lib/dmadev/rte_dmadev.h | 15 +-
lib/eal/common/eal_common_lcore_telemetry.c | 293 ++++++++++++++++++++
lib/eal/common/meson.build | 1 +
lib/eal/include/rte_lcore.h | 80 ++++++
lib/eal/meson.build | 3 +
lib/eal/version.map | 7 +
lib/ethdev/rte_ethdev.h | 2 +
lib/eventdev/rte_eventdev.h | 10 +-
lib/rawdev/rte_rawdev.c | 6 +-
lib/regexdev/rte_regexdev.h | 5 +-
lib/ring/rte_ring_elem_pvt.h | 1 +
meson_options.txt | 2 +
19 files changed, 459 insertions(+), 24 deletions(-)
create mode 100644 lib/eal/common/eal_common_lcore_telemetry.c
<snip>
diff --git a/lib/eal/common/eal_common_lcore_telemetry.c
b/lib/eal/common/eal_common_lcore_telemetry.c
new file mode 100644
index 0000000000..bba0afc26d
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_telemetry.c
@@ -0,0 +1,293 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2014 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <limits.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_errno.h>
+#include <rte_lcore.h>
+
+#ifdef RTE_LCORE_POLL_BUSYNESS
+#include <rte_telemetry.h>
+#endif
+
+int __rte_lcore_telemetry_enabled;
Is "telemetry" really the term to use here? Isn't this just another
piece of statistics? It can be used for telemetry, or in some other
fashion.
(Use bool not int.)
Can rename to '__rte_lcore_stats_enabled' in next revision.
+
+#ifdef RTE_LCORE_POLL_BUSYNESS
+
+struct lcore_telemetry {
+ int poll_busyness;
+ /**< Calculated poll busyness (gets set/returned by the API) */
+ int raw_poll_busyness;
+ /**< Calculated poll busyness times 100. */
+ uint64_t interval_ts;
+ /**< when previous telemetry interval started */
+ uint64_t empty_cycles;
+ /**< empty cycle count since last interval */
+ uint64_t last_poll_ts;
+ /**< last poll timestamp */
+ bool last_empty;
+ /**< if last poll was empty */
+ unsigned int contig_poll_cnt;
+ /**< contiguous (always empty/non empty) poll counter */
+} __rte_cache_aligned;
+
+static struct lcore_telemetry *telemetry_data;
+
+#define LCORE_POLL_BUSYNESS_MAX 100
+#define LCORE_POLL_BUSYNESS_NOT_SET -1
+#define LCORE_POLL_BUSYNESS_MIN 0
+
+#define SMOOTH_COEFF 5
+#define STATE_CHANGE_OPT 32
+
+static void lcore_config_init(void)
+{
+ int lcore_id;
+
+ telemetry_data = calloc(RTE_MAX_LCORE, sizeof(telemetry_data[0]));
+ if (telemetry_data == NULL)
+ rte_panic("Could not init lcore telemetry data: Out of
memory\n");
+
+ RTE_LCORE_FOREACH(lcore_id) {
+ struct lcore_telemetry *td = &telemetry_data[lcore_id];
+
+ td->interval_ts = 0;
+ td->last_poll_ts = 0;
+ td->empty_cycles = 0;
+ td->last_empty = true;
+ td->contig_poll_cnt = 0;
+ td->poll_busyness = LCORE_POLL_BUSYNESS_NOT_SET;
+ td->raw_poll_busyness = 0;
+ }
+}
+
+int rte_lcore_poll_busyness(unsigned int lcore_id)
+{
+ const uint64_t active_thresh = rte_get_tsc_hz() *
RTE_LCORE_POLL_BUSYNESS_PERIOD_MS;
+ struct lcore_telemetry *tdata;
+
+ if (lcore_id >= RTE_MAX_LCORE)
+ return -EINVAL;
+ tdata = &telemetry_data[lcore_id];
+
+ /* if the lcore is not active */
+ if (tdata->interval_ts == 0)
+ return LCORE_POLL_BUSYNESS_NOT_SET;
+ /* if the core hasn't been active in a while */
+ else if ((rte_rdtsc() - tdata->interval_ts) > active_thresh)
+ return LCORE_POLL_BUSYNESS_NOT_SET;
+
+ /* this core is active, report its poll busyness */
+ return telemetry_data[lcore_id].poll_busyness;
+}
+
+int rte_lcore_poll_busyness_enabled(void)
+{
+ return __rte_lcore_telemetry_enabled;
+}
+
+void rte_lcore_poll_busyness_enabled_set(int enable)
Use bool.
+{
+ __rte_lcore_telemetry_enabled = !!enable;
!!Another reason to use bool!! :)
Are you allowed to call this function during operation, you'll need a
atomic store here (and an atomic load on the read side).
Ack
+
+ if (!enable)
+ lcore_config_init();
+}
+
+static inline int calc_raw_poll_busyness(const struct
lcore_telemetry *tdata,
+ const uint64_t empty, const uint64_t total)
+{
+ /*
+ * we don't want to use floating point math here, but we want
for our poll
+ * busyness to react smoothly to sudden changes, while still
keeping the
+ * accuracy and making sure that over time the average follows
poll busyness
+ * as measured just-in-time. therefore, we will calculate the
average poll
+ * busyness using integer math, but shift the decimal point two
places
+ * to the right, so that 100.0 becomes 10000. this allows us to
report
+ * integer values (0..100) while still allowing ourselves to
follow the
+ * just-in-time measurements when we calculate our averages.
+ */
+ const int max_raw_idle = LCORE_POLL_BUSYNESS_MAX * 100;
+
Why not just store/manage the number of busy (or idle, or both)
cycles? Then the user can decide what time period to average over, to
what extent the lcore utilization from previous periods should be
factored in, etc.
There's an option 'RTE_LCORE_POLL_BUSYNESS_PERIOD_MS' added to
rte_config.h which would allow the user to define the time period over
which the utilization should be reported. We only do this calculation if
that time interval has elapsed.
In DSW, I initially presented only a load statistic (which averaged
over 250 us, with some contribution from previous period). I later
came to realize that just exposing the number of busy cycles left the
calling application much more options. For example, to present the
average load during 1 s, you needed to have some control thread
sampling the load statistic during that time period, as opposed to
when the busy cycles statistic was introduced, it just had to read
that value twice (at the beginning of the period, and at the end), and
compared it will the amount of wallclock time passed.
<snip>