> > On 22-Jun-21 10:41 AM, Ananyev, Konstantin wrote: > > > >> Currently, there is a hard limitation on the PMD power management > >> support that only allows it to support a single queue per lcore. This is > >> not ideal as most DPDK use cases will poll multiple queues per core. > >> > >> The PMD power management mechanism relies on ethdev Rx callbacks, so it > >> is very difficult to implement such support because callbacks are > >> effectively stateless and have no visibility into what the other ethdev > >> devices are doing. This places limitations on what we can do within the > >> framework of Rx callbacks, but the basics of this implementation are as > >> follows: > >> > >> - Replace per-queue structures with per-lcore ones, so that any device > >> polled from the same lcore can share data > >> - Any queue that is going to be polled from a specific lcore has to be > >> added to the list of cores to poll, so that the callback is aware of > >> other queues being polled by the same lcore > >> - Both the empty poll counter and the actual power saving mechanism is > >> shared between all queues polled on a particular lcore, and is only > >> activated when a special designated "power saving" queue is polled. To > >> put it another way, we have no idea which queue the user will poll in > >> what order, so we rely on them telling us that queue X is the last one > >> in the polling loop, so any power management should happen there. > >> - A new API is added to mark a specific Rx queue as "power saving". > > > > Honestly, I don't understand the logic behind that new function. > > I understand that depending on HW we ca monitor either one or multiple > > queues. > > That's ok, but why we now need to mark one queue as a 'very special' one? > > Because we don't know which of the queues we are supposed to sleep on. > > Imagine a situation where you have 3 queues. What usually happens is you > poll them in a loop, so q0, q1, q2, q0, q1, q2... etc. We only want to > enter power-optimized state on polling q2, because otherwise we're > risking going into power optimized state while q1 or q2 have traffic.
That's why before going to sleep we need to make sure that for *all* queues we have at least EMPTYPOLL_MAX empty polls. Then the order of queue checking wouldn't matter. With your example it should be: if (q1.empty_polls > EMPTYPOLL_MAX && q2. empty_polls > EMPTYPOLL_MAX && q3.empy_pools > EMPTYPOLL_MAX) goto_sleep; Don't take me wrong, I am not suggesting to make *precisely* that checks in the actual code (it could be time consuming if number of checks is big), but the logic needs to remain. > > Worst case scenario, we enter sleep after polling q0, then traffic > arrives at q2, we wake up, and then attempt to go to sleep on q1 instead > of skipping it. Essentially, we will be attempting to sleep at every > queue, instead of once in a loop. This *might* be OK for multi-monitor > because we'll be aborting sleep due to sleep condition check failure, > but for modes like rte_pause()/rte_power_pause()-based sleep, we will be > entering sleep unconditionally, and will be risking to sleep at q1 while > there's traffic at q2. > > So, we need this mechanism to be activated once every *loop*, not per queue. > > > Why can't rte_power_ethdev_pmgmt_queue_enable() just: > > Check is number of monitored queues exceed HW/SW capabilities, > > and if so then just return a failure. > > Otherwise add queue to the list and treat them all equally, i.e: > > go to power save mode when number of sequential empty polls on > > all monitored queues will exceed EMPTYPOLL_MAX threshold? > > > >> Failing to call this API will result in no power management, however > >> when having only one queue per core it is obvious which queue is the > >> "power saving" one, so things will still work without this new API for > >> use cases that were previously working without it. > >> - The limitation on UMWAIT-based polling is not removed because UMWAIT > >> is incapable of monitoring more than one address. > >> > >> Signed-off-by: Anatoly Burakov <anatoly.bura...@intel.com> > >> --- > >> lib/power/rte_power_pmd_mgmt.c | 335 ++++++++++++++++++++++++++------- > >> lib/power/rte_power_pmd_mgmt.h | 34 ++++ > >> lib/power/version.map | 3 + > >> 3 files changed, 306 insertions(+), 66 deletions(-) > >> > >> diff --git a/lib/power/rte_power_pmd_mgmt.c > >> b/lib/power/rte_power_pmd_mgmt.c > >> index 0707c60a4f..60dd21a19c 100644 > >> --- a/lib/power/rte_power_pmd_mgmt.c > >> +++ b/lib/power/rte_power_pmd_mgmt.c > >> @@ -33,7 +33,19 @@ enum pmd_mgmt_state { > >> PMD_MGMT_ENABLED > >> }; > >> > >> -struct pmd_queue_cfg { > >> +struct queue { > >> + uint16_t portid; > >> + uint16_t qid; > >> +}; > > > > Just a thought: if that would help somehow, it can be changed to: > > union queue { > > uint32_t raw; > > struct { uint16_t portid, qid; > > }; > > }; > > > > That way in queue find/cmp functions below you can operate with single raw > > 32-bt values. > > Probably not that important, as all these functions are on slow path, but > > might look nicer. > > Sure, that can work. We actually do comparisons with power save queue on > fast path, so maybe that'll help. > > > > >> +struct pmd_core_cfg { > >> + struct queue queues[RTE_MAX_ETHPORTS]; > > > > If we'll have ability to monitor multiple queues per lcore, would it be > > always enough? > > From other side, it is updated on control path only. > > Wouldn't normal list with malloc(/rte_malloc) would be more suitable here? > > You're right, it should be dynamically allocated. > > > > >> + /**< Which port-queue pairs are associated with this lcore? */ > >> + struct queue power_save_queue; > >> + /**< When polling multiple queues, all but this one will be ignored > >> */ > >> + bool power_save_queue_set; > >> + /**< When polling multiple queues, power save queue must be set */ > >> + size_t n_queues; > >> + /**< How many queues are in the list? */ > >> volatile enum pmd_mgmt_state pwr_mgmt_state; > >> /**< State of power management for this queue */ > >> enum rte_power_pmd_mgmt_type cb_mode; > >> @@ -43,8 +55,97 @@ struct pmd_queue_cfg { > >> uint64_t empty_poll_stats; > >> /**< Number of empty polls */ > >> } __rte_cache_aligned; > >> +static struct pmd_core_cfg lcore_cfg[RTE_MAX_LCORE]; > >> > >> -static struct pmd_queue_cfg > >> port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT]; > >> +static inline bool > >> +queue_equal(const struct queue *l, const struct queue *r) > >> +{ > >> + return l->portid == r->portid && l->qid == r->qid; > >> +} > >> + > >> +static inline void > >> +queue_copy(struct queue *dst, const struct queue *src) > >> +{ > >> + dst->portid = src->portid; > >> + dst->qid = src->qid; > >> +} > >> + > >> +static inline bool > >> +queue_is_power_save(const struct pmd_core_cfg *cfg, const struct queue > >> *q) { > > > > Here and in other places - any reason why standard DPDK coding style is not > > used? > > Just accidental :) > > > > >> + const struct queue *pwrsave = &cfg->power_save_queue; > >> + > >> + /* if there's only single queue, no need to check anything */ > >> + if (cfg->n_queues == 1) > >> + return true; > >> + return cfg->power_save_queue_set && queue_equal(q, pwrsave); > >> +} > >> + > > > -- > Thanks, > Anatoly