On Fri, Jul 15, 2022 at 07:16:17PM +0530, Jerin Jacob wrote: > On Fri, Jul 15, 2022 at 6:42 PM Anatoly Burakov > <anatoly.bura...@intel.com> wrote: > > > > Currently, there is no way to measure lcore busyness in a passive way, > > without any modifications to the application. This patch adds a new EAL > > API that will be able to passively track core busyness. > > > > The busyness is calculated by relying on the fact that most DPDK API's > > will poll for packets. Empty polls can be counted as "idle", while > > non-empty polls can be counted as busy. To measure lcore busyness, we > > simply call the telemetry timestamping function with the number of polls > > a particular code section has processed, and count the number of cycles > > we've spent processing empty bursts. The more empty bursts we encounter, > > the less cycles we spend in "busy" state, and the less core busyness > > will be reported. > > > > In order for all of the above to work without modifications to the > > application, the library code needs to be instrumented with calls to > > the lcore telemetry busyness timestamping function. The following parts > > of DPDK are instrumented with lcore telemetry calls: > > > > - All major driver API's: > > - ethdev > > - cryptodev > > - compressdev > > - regexdev > > - bbdev > > - rawdev > > - eventdev > > - dmadev > > - Some additional libraries: > > - ring > > - distributor > > > > To avoid performance impact from having lcore telemetry support, a > > global variable is exported by EAL, and a call to timestamping function > > is wrapped into a macro, so that whenever telemetry is disabled, it only > > takes one additional branch and no function calls are performed. It is > > also possible to disable it at compile time by commenting out > > RTE_LCORE_BUSYNESS from build config. > > > > This patch also adds a telemetry endpoint to report lcore busyness, as > > well as telemetry endpoints to enable/disable lcore telemetry. > > > > Signed-off-by: Kevin Laatz <kevin.la...@intel.com> > > Signed-off-by: Conor Walsh <conor.wa...@intel.com> > > Signed-off-by: David Hunt <david.h...@intel.com> > > Signed-off-by: Anatoly Burakov <anatoly.bura...@intel.com> > > > It is a good feature. Thanks for this work. >
+1 Some follow-up comments inline below. /Bruce > > > --- > > > > Notes: > > We did a couple of quick smoke tests to see if this patch causes any > > performance > > degradation, and it seemed to have none that we could measure. > > Telemetry can be > > disabled at compile time via a config option, while at runtime it can be > > disabled, seemingly at a cost of one additional branch. > > > > That said, our benchmarking efforts were admittedly not very rigorous, > > so > > comments welcome! > > > > > diff --git a/config/rte_config.h b/config/rte_config.h > > index 46549cb062..583cb6f7a5 100644 > > --- a/config/rte_config.h > > +++ b/config/rte_config.h > > @@ -39,6 +39,8 @@ > > #define RTE_LOG_DP_LEVEL RTE_LOG_INFO > > #define RTE_BACKTRACE 1 > > #define RTE_MAX_VFIO_CONTAINERS 64 > > +#define RTE_LCORE_BUSYNESS 1 > > Please don't enable debug features in fastpath as default. > I would disagree that this is a debug feature. The number of times I have heard from DPDK users that they wish to get more visibility into what the app is doing rather than seeing 100% cpu busy. Therefore, I'd see this as enabled by default rather than disabled - unless it's shown to have a performance regression. That said, since this impacts multiple components and it's something that end users might want to disable at build-time, I'd suggest moving it to meson_options.txt file and have it as an official DPDK build-time option. > > +#define RTE_LCORE_BUSYNESS_PERIOD 4000000ULL > > > > + > > +#include <unistd.h> > > +#include <limits.h> > > +#include <string.h> > > + > > +#include <rte_common.h> > > +#include <rte_cycles.h> > > +#include <rte_errno.h> > > +#include <rte_lcore.h> > > + > > +#ifdef RTE_LCORE_BUSYNESS > > This clutter may not be required. Let it compile all cases. > > > +#include <rte_telemetry.h> > > +#endif > > + > > +int __rte_lcore_telemetry_enabled; > > + > > +#ifdef RTE_LCORE_BUSYNESS > > + > > +struct lcore_telemetry { > > + int busyness; > > + /**< Calculated busyness (gets set/returned by the API) */ > > + int raw_busyness; > > + /**< Calculated busyness times 100. */ > > + uint64_t interval_ts; > > + /**< when previous telemetry interval started */ > > + uint64_t empty_cycles; > > + /**< empty cycle count since last interval */ > > + uint64_t last_poll_ts; > > + /**< last poll timestamp */ > > + bool last_empty; > > + /**< if last poll was empty */ > > + unsigned int contig_poll_cnt; > > + /**< contiguous (always empty/non empty) poll counter */ > > +} __rte_cache_aligned; > > +static struct lcore_telemetry telemetry_data[RTE_MAX_LCORE]; > > Allocate this from hugepage. > Yes, whether or not it's allocated from hugepages, dynamic allocation would be better than having static vars. > > + > > +void __rte_lcore_telemetry_timestamp(uint16_t nb_rx) > > +{ > > + const unsigned int lcore_id = rte_lcore_id(); > > + uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts; > > + struct lcore_telemetry *tdata = &telemetry_data[lcore_id]; > > + const bool empty = nb_rx == 0; > > + uint64_t diff_int, diff_last; > > + bool last_empty; > > + > > + last_empty = tdata->last_empty; > > + > > + /* optimization: don't do anything if status hasn't changed */ > > + if (last_empty == empty && tdata->contig_poll_cnt++ < 32) > > + return; > > + /* status changed or we're waiting for too long, reset counter */ > > + tdata->contig_poll_cnt = 0; > > + > > + cur_tsc = rte_rdtsc(); > > + > > + interval_ts = tdata->interval_ts; > > + empty_cycles = tdata->empty_cycles; > > + last_poll_ts = tdata->last_poll_ts; > > + > > + diff_int = cur_tsc - interval_ts; > > + diff_last = cur_tsc - last_poll_ts; > > + > > + /* is this the first time we're here? */ > > + if (interval_ts == 0) { > > + tdata->busyness = LCORE_BUSYNESS_MIN; > > + tdata->raw_busyness = 0; > > + tdata->interval_ts = cur_tsc; > > + tdata->empty_cycles = 0; > > + tdata->contig_poll_cnt = 0; > > + goto end; > > + } > > + > > + /* update the empty counter if we got an empty poll earlier */ > > + if (last_empty) > > + empty_cycles += diff_last; > > + > > + /* have we passed the interval? */ > > + if (diff_int > RTE_LCORE_BUSYNESS_PERIOD) { > > > I think, this function logic can be limited to just updating the > timestamp in the ring buffer, > and another control function of telemetry which runs in control core to do > heavy lifting to reduce the performance impact on fast path, > > > + int raw_busyness; > > + > > + /* get updated busyness value */ > > + raw_busyness = calc_raw_busyness(tdata, empty_cycles, > > diff_int); > > + > > + /* set a new interval, reset empty counter */ > > + tdata->interval_ts = cur_tsc; > > + tdata->empty_cycles = 0; > > + tdata->raw_busyness = raw_busyness; > > + /* bring busyness back to 0..100 range, biased to round up > > */ > > + tdata->busyness = (raw_busyness + 50) / 100; > > + } else > > + /* we may have updated empty counter */ > > + tdata->empty_cycles = empty_cycles; > > + > > +end: > > + /* update status for next poll */ > > + tdata->last_poll_ts = cur_tsc; > > + tdata->last_empty = empty; > > +} > > + > > +#ifdef RTE_LCORE_BUSYNESS > > +#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx) \ > > + do { \ > > + if (__rte_lcore_telemetry_enabled) \ > > I think, rather than reading memory, Like Linux perf infrastructure, > we can patch up the instruction steam as NOP vs Timestamp capture > function. > Surely that requires much more complicated tooling? How would that work in this situation? > > Also instead of changing all libraries, Maybe we can use > "-finstrument-functions". > and just mark the function with attribute. > > Just 2c. > > > + __rte_lcore_telemetry_timestamp(nb_rx); \ > > + } while (0) > > +#else > > +#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx) \ > > + while (0) > > +#endif > > +