Hi Dave, 
    Please check comment below. 

On 28 Sep 11:47, Hunt, David wrote:
> Hi Liang,
> 
> 
> On 17/9/2018 2:30 PM, Liang Ma wrote:
> >1. Abstract
> >
> >For packet processing workloads such as DPDK polling is continuous.
> >This means CPU cores always show 100% busy independent of how much work
> >those cores are doing. It is critical to accurately determine how busy
> >a core is hugely important for the following reasons:
> >
> >    * No indication of overload conditions
> >
> >    * User do not know how much real load is on a system meaning resulted 
> >    in
> >      wasted energy as no power management is utilized
> >
> >Compared to the original l3fwd-power design, instead of going to sleep
> >after detecting an empty poll, the new mechanism just lowers the core
> >frequency. As a result, the application does not stop polling the device,
> >which leads to improved handling of bursts of traffic.
> >
> >When the system become busy, the empty poll mechanism can also increase the
> >core frequency (including turbo) to do best effort for intensive traffic.
> >This gives us more flexible and balanced traffic awareness over the
> >standard l3fwd-power application.
> >
> >2. Proposed solution
> >
> >The proposed solution focuses on how many times empty polls are executed.
> >The less the number of empty polls, means current core is busy with
> >processing workload, therefore, the higher frequency is needed. The high
> >empty poll number indicates the current core not doing any real work
> >therefore, we can lower the frequency to safe power.
> >
> >In the current implementation, each core has 1 empty-poll counter which
> >assume 1 core is dedicated to 1 queue. This will need to be expanded in the
> >future to support multiple queues per core.
> >
> >2.1 Power state definition:
> >
> >     LOW:  Not currently used, reserved for future use.
> >
> >     MED:  the frequency is used to process modest traffic workload.
> >
> >     HIGH: the frequency is used to process busy traffic workload.
> >
> >2.2 There are two phases to establish the power management system:
> >
> >     a.Initialization/Training phase. The training phase is necessary
> >       in order to figure out the system polling baseline numbers from
> >       idle to busy. The highest poll count will be during idle, where
> >       all polls are empty. These poll counts will be different between
> >       systems due to the many possible processor micro-arch, cache
> >       and device configurations, hence the training phase.
> >       In the training phase, traffic is blocked so the training
> >       algorithm can average the empty-poll numbers for the LOW, MED and
> >       HIGH  power states in order to create a baseline.
> >       The core's counter are collected every 10ms, and the Training
> >       phase will take 2 seconds.
> >       Training is disabled as default configuration. the default
> >       parameter is applied. Simple App still can trigger training
> 
> Typo: "Simple" should be "Sample"
> 
> Suggest adding: Once the training phase has been executed once on a 
> system, the application
> can then be started with the relevant thresholds provided on the command 
> line, allowing the
> application to start passing start traffic immediately.
agree
> 
> >       if that's needed.
> >
> >     b.Normal phase. When the training phase is complete, traffic is
> >       started. The run-time poll counts are compared with the
> >       baseline and the decision will be taken to move to MED power
> >       state or HIGH power state. The counters are calculated every 10ms.
> 
> Propose changing the first sentence:  Traffic starts immediately based 
> on the default
> thresholds, or based on the user supplied thresholds via the command 
> line parameters.
>
agree
> 
> 
> 
> >3. Proposed  API
> >
> >1.  rte_power_empty_poll_stat_init(struct ep_params **eptr,
> >             uint8_t *freq_tlb, struct ep_policy *policy);
> >which is used to initialize the power management system.
> >  
> >2.  rte_power_empty_poll_stat_free(void);
> >which is used to free the resource hold by power management system.
> >  
> >3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
> >which is used to update specific core empty poll counter, not thread safe
> >  
> >4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> >which is used to update specific core valid poll counter, not thread safe
> >  
> >5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> >which is used to get specific core empty poll counter.
> >  
> >6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
> >which is used to get specific core valid poll counter.
> >
> >7.  rte_empty_poll_detection(struct rte_timer *tim, void *arg);
> >which is used to detect empty poll state changes then take action.
> >
> >ChangeLog:
> >v2: fix some coding style issues.
> >v3: rename the filename, API name.
> >v4: no change.
> >v5: no change.
> >v6: re-work the code layout, update API.
> >v7: fix minor typo and lift node num limit.
> >v8: disable training as default option.
> >
> >Signed-off-by: Liang Ma <liang.j...@intel.com>
> >
> >Reviewed-by: Lei Yao <lei.a....@intel.com>
> >---
> >  lib/librte_power/Makefile               |   6 +-
> >  lib/librte_power/meson.build            |   5 +-
> >  lib/librte_power/rte_power_empty_poll.c | 539 
> >  ++++++++++++++++++++++++++++++++
> >  lib/librte_power/rte_power_empty_poll.h | 219 +++++++++++++
> >  lib/librte_power/rte_power_version.map  |  13 +
> >  5 files changed, 778 insertions(+), 4 deletions(-)
> >  create mode 100644 lib/librte_power/rte_power_empty_poll.c
> >  create mode 100644 lib/librte_power/rte_power_empty_poll.h
> >
> >diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
> >index 6f85e88..a8f1301 100644
> >--- a/lib/librte_power/Makefile
> >+++ b/lib/librte_power/Makefile
> >@@ -6,8 +6,9 @@ include $(RTE_SDK)/mk/rte.vars.mk
> >  # library name
> >  LIB = librte_power.a
> >  
> >+CFLAGS += -DALLOW_EXPERIMENTAL_API
> >  CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -fno-strict-aliasing
> >-LDLIBS += -lrte_eal
> >+LDLIBS += -lrte_eal -lrte_timer
> >  
> >  EXPORT_MAP := rte_power_version.map
> >  
> >@@ -16,8 +17,9 @@ LIBABIVER := 1
> >  # all source are stored in SRCS-y
> >  SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c
> >  SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c
> >+SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_power_empty_poll.c
> >  
> >  # install this header file
> >-SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
> >+SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h  
> >rte_power_empty_poll.h
> >  
> >  include $(RTE_SDK)/mk/rte.lib.mk
> >diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
> >index 253173f..63957eb 100644
> >--- a/lib/librte_power/meson.build
> >+++ b/lib/librte_power/meson.build
> >@@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
> >     build = false
> >  endif
> >  sources = files('rte_power.c', 'power_acpi_cpufreq.c',
> >-            'power_kvm_vm.c', 'guest_channel.c')
> >-headers = files('rte_power.h')
> >+            'power_kvm_vm.c', 'guest_channel.c',
> >+            'rte_power_empty_poll.c')
> >+headers = files('rte_power.h','rte_power_empty_poll.h')
> >diff --git a/lib/librte_power/rte_power_empty_poll.c 
> >b/lib/librte_power/rte_power_empty_poll.c
> >new file mode 100644
> >index 0000000..587ce78
> >--- /dev/null
> >+++ b/lib/librte_power/rte_power_empty_poll.c
> >@@ -0,0 +1,539 @@
> >+/* SPDX-License-Identifier: BSD-3-Clause
> >+ * Copyright(c) 2010-2018 Intel Corporation
> >+ */
> >+
> >+#include <string.h>
> >+
> >+#include <rte_lcore.h>
> >+#include <rte_cycles.h>
> >+#include <rte_atomic.h>
> >+#include <rte_malloc.h>
> >+
> >+#include "rte_power.h"
> >+#include "rte_power_empty_poll.h"
> >+
> >+#define INTERVALS_PER_SECOND 100     /* (10ms) */
> >+#define SECONDS_TO_TRAIN_FOR 2
> >+#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
> >+#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
> >+#define DEFAULT_CYCLES_PER_PACKET 800
> >+
> >+static struct ep_params *ep_params;
> >+static uint32_t med_to_high_threshold = 
> >DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
> >+static uint32_t high_to_med_threshold = 
> >DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
> >+
> >+static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
> >+
> >+static uint32_t total_avail_freqs[RTE_MAX_LCORE];
> >+
> >+static uint32_t freq_index[NUM_FREQ];
> >+
> >+static uint32_t
> >+get_freq_index(enum freq_val index)
> >+{
> >+    return freq_index[index];
> >+}
> >+
> >+
> >+static int
> >+set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
> >+{
> >+    int err = 0;
> >+    uint32_t power_freq_index;
> >+    if (!specific_freq)
> >+            power_freq_index = get_freq_index(freq);
> >+    else
> >+            power_freq_index = freq;
> >+
> >+    err = rte_power_set_freq(lcore_id, power_freq_index);
> >+
> >+    return err;
> >+}
> >+
> >+
> >+static inline void __attribute__((always_inline))
> >+exit_training_state(struct priority_worker *poll_stats)
> >+{
> >+    RTE_SET_USED(poll_stats);
> >+}
> >+
> 
> Is this really needed? It does nothing, and is just local to this file.
> 
this is needed for debug purpose, I prefer keep it. 
> 
> >+static inline void __attribute__((always_inline))
> >+enter_training_state(struct priority_worker *poll_stats)
> >+{
> >+    poll_stats->iter_counter = 0;
> >+    poll_stats->cur_freq = LOW;
> >+    poll_stats->queue_state = TRAINING;
> >+}
> >+
> >+static inline void __attribute__((always_inline))
> >+enter_normal_state(struct priority_worker *poll_stats)
> >+{
> >+    /* Clear the averages arrays and strs */
> >+    memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> >+    poll_stats->ec = 0;
> >+    memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> >+    poll_stats->pc = 0;
> >+
> >+    poll_stats->cur_freq = MED;
> >+    poll_stats->iter_counter = 0;
> >+    poll_stats->threshold_ctr = 0;
> >+    poll_stats->queue_state = MED_NORMAL;
> >+    RTE_LOG(INFO, POWER, "set the poewr freq to MED\n");
> 
> Typo, "poewr" should be "power", also suggest "Set" rather than "set"
>
agree
> 
> >+    set_power_freq(poll_stats->lcore_id, MED, false);
> >+
> >+    /* Try here */
> 
> Not sure about this comment?
will be removed
> 
> >+    poll_stats->thresh[MED].threshold_percent = med_to_high_threshold;
> >+    poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold;
> >+}
> >+
> >+static inline void __attribute__((always_inline))
> >+enter_busy_state(struct priority_worker *poll_stats)
> >+{
> >+    memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> >+    poll_stats->ec = 0;
> >+    memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> >+    poll_stats->pc = 0;
> >+
> >+    poll_stats->cur_freq = HGH;
> >+    poll_stats->iter_counter = 0;
> >+    poll_stats->threshold_ctr = 0;
> >+    poll_stats->queue_state = HGH_BUSY;
> >+    set_power_freq(poll_stats->lcore_id, HGH, false);
> >+}
> >+
> >+static inline void __attribute__((always_inline))
> >+enter_purge_state(struct priority_worker *poll_stats)
> >+{
> >+    poll_stats->iter_counter = 0;
> >+    poll_stats->queue_state = LOW_PURGE;
> >+}
> >+
> >+static inline void __attribute__((always_inline))
> >+set_state(struct priority_worker *poll_stats,
> >+            enum queue_state new_state)
> >+{
> >+    enum queue_state old_state = poll_stats->queue_state;
> >+    if (old_state != new_state) {
> >+
> >+            /* Call any old state exit functions */
> >+            if (old_state == TRAINING)
> >+                    exit_training_state(poll_stats);
> 
> Is this needed? exit_training_state() does nothing.
> 
original code is used for debug purpose. we can leave it.
> >+            /* Call any new state entry functions */
> >+            if (new_state == TRAINING)
> >+                    enter_training_state(poll_stats);
> >+            if (new_state == MED_NORMAL)
> >+                    enter_normal_state(poll_stats);
> >+            if (new_state == HGH_BUSY)
> >+                    enter_busy_state(poll_stats);
> >+            if (new_state == LOW_PURGE)
> >+                    enter_purge_state(poll_stats);
> >+    }
> >+}
> >+
> >+static inline void __attribute__((always_inline))
> >+set_policy(struct priority_worker *poll_stats,
> >+            struct ep_policy *policy)
> >+{
> >+    set_state(poll_stats, policy->state);
> >+
> >+    if (policy->state == TRAINING)
> >+            return;
> >+
> >+    poll_stats->thresh[MED_NORMAL].base_edpi = policy->med_base_edpi;
> >+    poll_stats->thresh[HGH_BUSY].base_edpi = policy->hgh_base_edpi;
> >+
> >+    poll_stats->thresh[MED_NORMAL].trained = true;
> >+    poll_stats->thresh[HGH_BUSY].trained = true;
> >+
> >+}
> >+
> >+static void
> >+update_training_stats(struct priority_worker *poll_stats,
> >+            uint32_t freq,
> >+            bool specific_freq,
> >+            uint32_t max_train_iter)
> >+{
> >+    RTE_SET_USED(specific_freq);
> >+
> >+    char pfi_str[32];
> >+    uint64_t p0_empty_deq;
> >+
> >+    sprintf(pfi_str, "%02d", freq);
> >+
> >+    if (poll_stats->cur_freq == freq &&
> >+                    poll_stats->thresh[freq].trained == false) {
> >+            if (poll_stats->thresh[freq].cur_train_iter == 0) {
> >+
> >+                    set_power_freq(poll_stats->lcore_id,
> >+                                    freq, specific_freq);
> >+
> >+                    poll_stats->empty_dequeues_prev =
> >+                            poll_stats->empty_dequeues;
> >+
> >+                    poll_stats->thresh[freq].cur_train_iter++;
> >+
> >+                    return;
> >+            } else if (poll_stats->thresh[freq].cur_train_iter
> >+                            <= max_train_iter) {
> >+
> >+                    p0_empty_deq = poll_stats->empty_dequeues -
> >+                            poll_stats->empty_dequeues_prev;
> >+
> >+                    poll_stats->empty_dequeues_prev =
> >+                            poll_stats->empty_dequeues;
> >+
> >+                    poll_stats->thresh[freq].base_edpi += p0_empty_deq;
> >+                    poll_stats->thresh[freq].cur_train_iter++;
> >+
> >+            } else {
> >+                    if (poll_stats->thresh[freq].trained == false) {
> >+                            poll_stats->thresh[freq].base_edpi =
> >+                                    poll_stats->thresh[freq].base_edpi /
> >+                                    max_train_iter;
> >+
> >+                            /* Add on a factor of 0.05%*/
> >+                            /* this should remove any */
> >+                            /* false negatives when the system is 0% 
> >busy */
> 
> Multi line comment should follow the usual standard /* \n * text \n text 
> \n */
> 
agree
> >+                            poll_stats->thresh[freq].base_edpi +=
> >+                                    poll_stats->thresh[freq].base_edpi / 
> >2000;
> >+
> >+                            poll_stats->thresh[freq].trained = true;
> >+                            poll_stats->cur_freq++;
> >+
> >+                    }
> >+            }
> >+    }
> >+}
> >+
> >+static inline uint32_t __attribute__((always_inline))
> >+update_stats(struct priority_worker *poll_stats)
> >+{
> >+    uint64_t tot_edpi = 0, tot_ppi = 0;
> >+    uint32_t j, percent;
> >+
> >+    struct priority_worker *s = poll_stats;
> >+
> >+    uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev;
> >+
> >+    s->empty_dequeues_prev = s->empty_dequeues;
> >+
> >+    uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev;
> >+
> >+    s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
> >+
> >+    if (s->thresh[s->cur_freq].base_edpi < cur_edpi) {
> >+            RTE_LOG(DEBUG, POWER, "cur_edpi is too large "
> >+                            "cur edpi %ld "
> >+                            "base epdi %ld\n",
> >+                            cur_edpi,
> >+                            s->thresh[s->cur_freq].base_edpi);
> 
> Suggest making this log message more meaningful to the user. I suspect 
> that "cur_edpi" will not mean much to the user.
> What does edpi mean?
>
agree
> >+            /* Value to make us fail need debug log*/
> >+            return 1000UL;
> >+    }
> >+
> >+    s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
> >+    s->ppi_av[s->pc++ % BINS_AV] = ppi;
> >+
> >+    for (j = 0; j < BINS_AV; j++) {
> >+            tot_edpi += s->edpi_av[j];
> >+            tot_ppi += s->ppi_av[j];
> >+    }
> >+
> >+    tot_edpi = tot_edpi / BINS_AV;
> >+
> >+    percent = 100 - (uint32_t)(((float)tot_edpi /
> >+                    (float)s->thresh[s->cur_freq].base_edpi) * 100);
> >+
> >+    return (uint32_t)percent;
> >+}
> >+
> >+
> >+static inline void  __attribute__((always_inline))
> >+update_stats_normal(struct priority_worker *poll_stats)
> >+{
> >+    uint32_t percent;
> >+
> >+    if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0) {
> >+
> >+            RTE_LOG(DEBUG, POWER, "cure freq is %d, edpi is %lu\n",
> >+                            poll_stats->cur_freq,
> >+                     poll_stats->thresh[poll_stats->cur_freq].base_edpi);
> 
> Again, a more meaningful explanation of edpi is needed here.
> 
agree
> >+            return;
> >+    }
> >+
> >+    percent = update_stats(poll_stats);
> >+
> >+    if (percent > 100) {
> >+            RTE_LOG(DEBUG, POWER, "Big than 100 abnormal\n");
> 
> Please change to something meaningful to the user. What is the 
> percentage returned from update_stats()?
> 
agree
> >+            return;
> >+    }
> >+
> >+    if (poll_stats->cur_freq == LOW)
> >+            RTE_LOG(INFO, POWER, "Purge Mode is not supported\n");
> 
> Suggest adding "currently" - "Purge Mode is not currently supported\n"
> 
agree
> >+    else if (poll_stats->cur_freq == MED) {
> >+
> >+            if (percent >
> >+                    poll_stats->thresh[MED].threshold_percent) {
> >+
> >+                    if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
> >+                            poll_stats->threshold_ctr++;
> >+                    else {
> >+                            set_state(poll_stats, HGH_BUSY);
> >+                            RTE_LOG(INFO, POWER, "MOVE to HGH\n");
> >+                    }
> >+
> >+            } else {
> >+                    /* reset */
> >+                    poll_stats->threshold_ctr = 0;
> >+            }
> >+
> >+    } else if (poll_stats->cur_freq == HGH) {
> >+
> >+            if (percent <
> >+                            poll_stats->thresh[HGH].threshold_percent) {
> >+
> >+                    if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
> >+                            poll_stats->threshold_ctr++;
> >+                    else {
> >+                            set_state(poll_stats, MED_NORMAL);
> >+                            RTE_LOG(INFO, POWER, "MOVE to MED\n");
> >+                    }
> >+            } else {
> >+                    /* reset */
> >+                    poll_stats->threshold_ctr = 0;
> >+            }
> >+
> >+    }
> >+}
> >+
> >+static int
> >+empty_poll_training(struct priority_worker *poll_stats,
> >+            uint32_t max_train_iter)
> >+{
> >+
> >+    if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
> >+            poll_stats->iter_counter++;
> >+            return 0;
> >+    }
> >+
> >+
> >+    update_training_stats(poll_stats,
> >+                    LOW,
> >+                    false,
> >+                    max_train_iter);
> >+
> >+    update_training_stats(poll_stats,
> >+                    MED,
> >+                    false,
> >+                    max_train_iter);
> >+
> >+    update_training_stats(poll_stats,
> >+                    HGH,
> >+                    false,
> >+                    max_train_iter);
> >+
> >+
> >+    if (poll_stats->thresh[LOW].trained == true
> >+                    && poll_stats->thresh[MED].trained == true
> >+                    && poll_stats->thresh[HGH].trained == true) {
> >+
> >+            set_state(poll_stats, MED_NORMAL);
> >+
> >+            RTE_LOG(INFO, POWER, "Low threshold is %lu\n",
> >+                            poll_stats->thresh[LOW].base_edpi);
> 
> Suggest "Low" change to "LOW" for consistency with other log messages below.
> 
agree
> >+
> >+            RTE_LOG(INFO, POWER, "MED threshold is %lu\n",
> >+                            poll_stats->thresh[MED].base_edpi);
> >+
> >+
> >+            RTE_LOG(INFO, POWER, "HIGH threshold is %lu\n",
> >+                            poll_stats->thresh[HGH].base_edpi);
> >+
> >+            RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
> >+                            poll_stats->lcore_id);
> >+    }
> >+
> >+    return 0;
> >+}
> >+
> >+void
> >+rte_empty_poll_detection(struct rte_timer *tim, void *arg)
> >+{
> >+
> >+    uint32_t i;
> >+
> >+    struct priority_worker *poll_stats;
> >+
> >+    RTE_SET_USED(tim);
> >+
> >+    RTE_SET_USED(arg);
> >+
> >+    for (i = 0; i < NUM_NODES; i++) {
> >+
> >+            poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
> >+
> >+            if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
> >+                    continue;
> >+
> >+            switch (poll_stats->queue_state) {
> >+            case(TRAINING):
> >+                    empty_poll_training(poll_stats,
> >+                                    ep_params->max_train_iter);
> >+                    break;
> >+
> >+            case(HGH_BUSY):
> >+            case(MED_NORMAL):
> >+                    update_stats_normal(poll_stats);
> >+                    break;
> >+
> >+            case(LOW_PURGE):
> >+                    break;
> >+            default:
> >+                    break;
> >+
> >+            }
> >+
> >+    }
> >+
> >+}
> >+
> >+int __rte_experimental
> >+rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb,
> >+            struct ep_policy *policy)
> >+{
> >+    uint32_t i;
> >+    /* Allocate the ep_params structure */
> >+    ep_params = rte_zmalloc_socket(NULL,
> >+                    sizeof(struct ep_params),
> >+                    0,
> >+                    rte_socket_id());
> >+
> >+    if (!ep_params)
> >+            rte_panic("Cannot allocate heap memory for ep_params "
> >+                            "for socket %d\n", rte_socket_id());
> >+
> >+    if (freq_tlb == NULL) {
> >+            freq_index[LOW] = 14;
> >+            freq_index[MED] = 9;
> >+            freq_index[HGH] = 1;
> >+    } else {
> >+            freq_index[LOW] = freq_tlb[LOW];
> >+            freq_index[MED] = freq_tlb[MED];
> >+            freq_index[HGH] = freq_tlb[HGH];
> >+    }
> >+
> >+    RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
> >+
> >+    /* 5 seconds worth of training */
> 
> This now looks to be 2 seconds from the #define above. Maybe: /* Train 
> for pre-defined period */
> 
agree
> >+    ep_params->max_train_iter = INTERVALS_PER_SECOND * 
> >SECONDS_TO_TRAIN_FOR;
> >+
> >+    struct stats_data *w = &ep_params->wrk_data;
> >+
> >+    *eptr = ep_params;
> >+
> >+    /* initialize all wrk_stats state */
> >+    for (i = 0; i < NUM_NODES; i++) {
> >+
> >+            if (rte_lcore_is_enabled(i) == 0)
> >+                    continue;
> >+            /*init the freqs table */
> >+            total_avail_freqs[i] = rte_power_freqs(i,
> >+                            avail_freqs[i],
> >+                            NUM_FREQS);
> >+
> >+            RTE_LOG(INFO, POWER, "total avail freq is %d , lcoreid %d\n",
> >+                            total_avail_freqs[i],
> >+                            i);
> >+
> >+            if (get_freq_index(LOW) > total_avail_freqs[i])
> >+                    return -1;
> >+
> >+            if (rte_get_master_lcore() != i) {
> >+                    w->wrk_stats[i].lcore_id = i;
> >+                    set_policy(&w->wrk_stats[i], policy);
> >+            }
> >+    }
> >+
> >+    return 0;
> >+}
> >+
> >+void __rte_experimental
> >+rte_power_empty_poll_stat_free(void)
> >+{
> >+
> >+    RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
> >+
> >+    if (ep_params != NULL)
> >+            rte_free(ep_params);
> >+}
> >+
> >+int __rte_experimental
> >+rte_power_empty_poll_stat_update(unsigned int lcore_id)
> >+{
> >+    struct priority_worker *poll_stats;
> >+
> >+    if (lcore_id >= NUM_NODES)
> >+            return -1;
> >+
> >+    poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> >+
> >+    if (poll_stats->lcore_id == 0)
> >+            poll_stats->lcore_id = lcore_id;
> >+
> >+    poll_stats->empty_dequeues++;
> >+
> >+    return 0;
> >+}
> >+
> >+int __rte_experimental
> >+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
> >+{
> >+
> >+    struct priority_worker *poll_stats;
> >+
> >+    if (lcore_id >= NUM_NODES)
> >+            return -1;
> >+
> >+    poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> >+
> >+    if (poll_stats->lcore_id == 0)
> >+            poll_stats->lcore_id = lcore_id;
> >+
> >+    poll_stats->num_dequeue_pkts += nb_pkt;
> >+
> >+    return 0;
> >+}
> >+
> >+
> >+uint64_t __rte_experimental
> >+rte_power_empty_poll_stat_fetch(unsigned int lcore_id)
> >+{
> >+    struct priority_worker *poll_stats;
> >+
> >+    if (lcore_id >= NUM_NODES)
> >+            return -1;
> >+
> >+    poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> >+
> >+    if (poll_stats->lcore_id == 0)
> >+            poll_stats->lcore_id = lcore_id;
> >+
> >+    return poll_stats->empty_dequeues;
> >+}
> >+
> >+uint64_t __rte_experimental
> >+rte_power_poll_stat_fetch(unsigned int lcore_id)
> >+{
> >+    struct priority_worker *poll_stats;
> >+
> >+    if (lcore_id >= NUM_NODES)
> >+            return -1;
> >+
> >+    poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> >+
> >+    if (poll_stats->lcore_id == 0)
> >+            poll_stats->lcore_id = lcore_id;
> >+
> >+    return poll_stats->num_dequeue_pkts;
> >+}
> >diff --git a/lib/librte_power/rte_power_empty_poll.h 
> >b/lib/librte_power/rte_power_empty_poll.h
> >new file mode 100644
> >index 0000000..ae27f7d
> >--- /dev/null
> >+++ b/lib/librte_power/rte_power_empty_poll.h
> >@@ -0,0 +1,219 @@
> >+/* SPDX-License-Identifier: BSD-3-Clause
> >+ * Copyright(c) 2010-2018 Intel Corporation
> >+ */
> >+
> >+#ifndef _RTE_EMPTY_POLL_H
> >+#define _RTE_EMPTY_POLL_H
> >+
> >+/**
> >+ * @file
> >+ * RTE Power Management
> >+ */
> >+#include <stdint.h>
> >+#include <stdbool.h>
> >+
> >+#include <rte_common.h>
> >+#include <rte_byteorder.h>
> >+#include <rte_log.h>
> >+#include <rte_string_fns.h>
> >+#include <rte_power.h>
> >+#include <rte_timer.h>
> >+
> >+#ifdef __cplusplus
> >+extern "C" {
> >+#endif
> >+
> >+#define NUM_FREQS 20
> 
> I don't think this is enough. Suggest using RTE_MAX_LCORE_FREQS
> 
agree
> >+
> >+#define BINS_AV 4 /* Has to be ^2 */
> >+
> >+#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
> >+
> >+#define NUM_PRIORITIES          2
> >+
> >+#define NUM_NODES         256  /* Max core number*/
> >+
> >+/* Processor Power State */
> >+enum freq_val {
> >+    LOW,
> >+    MED,
> >+    HGH,
> >+    NUM_FREQ = NUM_FREQS
> >+};
> >+
> 
> Why is NUM_FREQ in this enum? 0,1,2,20 (or RTE_MAX_LCORE_FREQS) does not 
> seem right.
> If you are using NUM_FREQ in the code then why not just use NUM_FREQS.
> 
the reason is to have some spare space in case we need extend to more stage.

> >+
> >+/* Queue Polling State */
> >+enum queue_state {
> >+    TRAINING, /* NO TRAFFIC */
> >+    MED_NORMAL,   /* MED */
> >+    HGH_BUSY,     /* HIGH */
> >+    LOW_PURGE,    /* LOW */
> >+};
> >+
> >+/* Queue Stats */
> >+struct freq_threshold {
> >+
> >+    uint64_t base_edpi;
> >+    bool trained;
> >+    uint32_t threshold_percent;
> >+    uint32_t cur_train_iter;
> >+};
> >+
> >+/* Each Worder Thread Empty Poll Stats */
> >+struct priority_worker {
> >+
> >+    /* Current dequeue and throughput counts */
> >+    /* These 2 are written to by the worker threads */
> >+    /* So keep them on their own cache line */
> >+    uint64_t empty_dequeues;
> >+    uint64_t num_dequeue_pkts;
> >+
> >+    enum queue_state queue_state;
> >+
> >+    uint64_t empty_dequeues_prev;
> >+    uint64_t num_dequeue_pkts_prev;
> >+
> >+    /* Used for training only */
> >+    struct freq_threshold thresh[NUM_FREQ];
> >+    enum freq_val cur_freq;
> >+
> >+    /* bucket arrays to calculate the averages */
> >+    uint64_t edpi_av[BINS_AV];
> >+    uint32_t  ec;
> >+    uint64_t ppi_av[BINS_AV];
> >+    uint32_t  pc;
> >+
> >+    uint32_t lcore_id;
> >+    uint32_t iter_counter;
> >+    uint32_t threshold_ctr;
> >+    uint32_t display_ctr;
> >+    uint8_t  dev_id;
> >+
> >+} __rte_cache_aligned;
> >+
> 
> Suggest adding a comment on each of the variables above explaining what 
> the acronym means.
> E.g. edpi, ec, pc, ppi.
> 
agree
> 
> >+
> >+struct stats_data {
> >+
> >+    struct priority_worker wrk_stats[NUM_NODES];
> >+
> >+    /* flag to stop rx threads processing packets until training over */
> >+    bool start_rx;
> >+
> >+};
> >+
> >+/* Empty Poll Parameters */
> >+struct ep_params {
> >+
> >+    /* Timer related stuff */
> >+    uint64_t interval_ticks;
> >+    uint32_t max_train_iter;
> >+
> >+    struct rte_timer timer0;
> >+    struct stats_data wrk_data;
> >+};
> >+
> >+
> >+/* Sample App Init information */
> >+struct ep_policy {
> >+
> >+    uint64_t med_base_edpi;
> >+    uint64_t hgh_base_edpi;
> >+
> >+    enum queue_state state;
> >+};
> >+
> >+
> >+
> >+/**
> >+ * Initialize the power management system.
> >+ *
> >+ * @param eptr
> >+ *   the structure of empty poll configuration
> >+ * @freq_tlb
> >+ *   the power state/frequency  mapping table
> >+ * @policy
> >+ *   the initialization policy from sample app
> >+ *
> >+ * @return
> >+ *  - 0 on success.
> >+ *  - Negative on error.
> >+ */
> >+int __rte_experimental
> >+rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb,
> >+            struct ep_policy *policy);
> >+
> >+/**
> >+ * Free the resource hold by power management system.
> >+ */
> >+void __rte_experimental
> >+rte_power_empty_poll_stat_free(void);
> >+
> >+/**
> >+ * Update specific core empty poll counter
> >+ * It's not thread safe.
> >+ *
> >+ * @param lcore_id
> >+ *  lcore id
> >+ *
> >+ * @return
> >+ *  - 0 on success.
> >+ *  - Negative on error.
> >+ */
> >+int __rte_experimental
> >+rte_power_empty_poll_stat_update(unsigned int lcore_id);
> >+
> >+/**
> >+ * Update specific core valid poll counter, not thread safe.
> >+ *
> >+ * @param lcore_id
> >+ *  lcore id.
> >+ * @param nb_pkt
> >+ *  The packet number of one valid poll.
> >+ *
> >+ * @return
> >+ *  - 0 on success.
> >+ *  - Negative on error.
> >+ */
> >+int __rte_experimental
> >+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> >+
> >+/**
> >+ * Fetch specific core empty poll counter.
> >+ *
> >+ * @param lcore_id
> >+ *  lcore id
> >+ *
> >+ * @return
> >+ *  Current lcore empty poll counter value.
> >+ */
> >+uint64_t __rte_experimental
> >+rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> >+
> >+/**
> >+ * Fetch specific core valid poll counter.
> >+ *
> >+ * @param lcore_id
> >+ *  lcore id
> >+ *
> >+ * @return
> >+ *  Current lcore valid poll counter value.
> >+ */
> >+uint64_t __rte_experimental
> >+rte_power_poll_stat_fetch(unsigned int lcore_id);
> >+
> >+/**
> >+ * Empty poll  state change detection function
> >+ *
> >+ * @param  tim
> >+ *  The timer structure
> >+ * @param  arg
> >+ *  The customized parameter
> >+ */
> >+void  __rte_experimental
> >+rte_empty_poll_detection(struct rte_timer *tim, void *arg);
> >+
> >+#ifdef __cplusplus
> >+}
> >+#endif
> >+
> >+#endif
> >diff --git a/lib/librte_power/rte_power_version.map 
> >b/lib/librte_power/rte_power_version.map
> >index dd587df..11ffdfb 100644
> >--- a/lib/librte_power/rte_power_version.map
> >+++ b/lib/librte_power/rte_power_version.map
> >@@ -33,3 +33,16 @@ DPDK_18.08 {
> >     rte_power_get_capabilities;
> >  
> >  } DPDK_17.11;
> >+
> >+EXPERIMENTAL {
> >+        global:
> >+
> >+        rte_power_empty_poll_stat_init;
> >+        rte_power_empty_poll_stat_free;
> >+        rte_power_empty_poll_stat_update;
> >+        rte_power_empty_poll_stat_fetch;
> >+        rte_power_poll_stat_fetch;
> >+        rte_power_poll_stat_update;
> >+        rte_empty_poll_detection;
> >+
> >+};
> 
> checkpatch has several warnings:
> 
> 
> 
> ### lib/librte_power: traffic pattern aware power control
> 
> WARNING:LONG_LINE: line over 80 characters
> #355: FILE: lib/librte_power/rte_power_empty_poll.c:199:
> + poll_stats->thresh[freq].base_edpi / 2000;
> 
> WARNING:LONG_LINE: line over 80 characters
> #417: FILE: lib/librte_power/rte_power_empty_poll.c:261:
> + poll_stats->thresh[poll_stats->cur_freq].base_edpi);
> 
> total: 0 errors, 2 warnings, 802 lines checked
> ERROR: symbol rte_empty_poll_detection is added in a section other than 
> the EXPERIMENTAL section of the version map
> ERROR: symbol rte_power_empty_poll_stat_fetch is added in a section 
> other than the EXPERIMENTAL section of the version map
> ERROR: symbol rte_power_empty_poll_stat_free is added in a section other 
> than the EXPERIMENTAL section of the version map
> ERROR: symbol rte_power_empty_poll_stat_init is added in a section other 
> than the EXPERIMENTAL section of the version map
> ERROR: symbol rte_power_empty_poll_stat_update is added in a section 
> other than the EXPERIMENTAL section of the version map
> ERROR: symbol rte_power_poll_stat_fetch is added in a section other than 
> the EXPERIMENTAL section of the version map
> ERROR: symbol rte_power_poll_stat_update is added in a section other 
> than the EXPERIMENTAL section of the version map
> Warning in /lib/librte_power/rte_power_empty_poll.c:
> are you sure you want to add the following:
> rte_panic\(
> 
version map checking script(awk script to parse section name) has some issue 
here. 
> 
> Rgds,
> Dave.
> 
> 
> 
> 
> 
> 

Reply via email to