> Currently, there is a hard limitation on the PMD power management > support that only allows it to support a single queue per lcore. This is > not ideal as most DPDK use cases will poll multiple queues per core. > > The PMD power management mechanism relies on ethdev Rx callbacks, so it > is very difficult to implement such support because callbacks are > effectively stateless and have no visibility into what the other ethdev > devices are doing. This places limitations on what we can do within the > framework of Rx callbacks, but the basics of this implementation are as > follows: > > - Replace per-queue structures with per-lcore ones, so that any device > polled from the same lcore can share data > - Any queue that is going to be polled from a specific lcore has to be > added to the list of queues to poll, so that the callback is aware of > other queues being polled by the same lcore > - Both the empty poll counter and the actual power saving mechanism is > shared between all queues polled on a particular lcore, and is only > activated when all queues in the list were polled and were determined > to have no traffic. > - The limitation on UMWAIT-based polling is not removed because UMWAIT > is incapable of monitoring more than one address. > > Also, while we're at it, update and improve the docs. > > Signed-off-by: Anatoly Burakov <anatoly.bura...@intel.com> > --- > > Notes: > v6: > - Track each individual queue sleep status (Konstantin) > - Fix segfault (Dave) > > v5: > - Remove the "power save queue" API and replace it with mechanism > suggested by > Konstantin > > v3: > - Move the list of supported NICs to NIC feature table > > v2: > - Use a TAILQ for queues instead of a static array > - Address feedback from Konstantin > - Add additional checks for stopped queues > > doc/guides/nics/features.rst | 10 + > doc/guides/prog_guide/power_man.rst | 65 ++-- > doc/guides/rel_notes/release_21_08.rst | 3 + > lib/power/rte_power_pmd_mgmt.c | 452 +++++++++++++++++++------ > 4 files changed, 394 insertions(+), 136 deletions(-) > > diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst > index 403c2b03a3..a96e12d155 100644 > --- a/doc/guides/nics/features.rst > +++ b/doc/guides/nics/features.rst > @@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information. > * **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``. > * **[related] API**: ``rte_eth_rx_burst_mode_get()``, > ``rte_eth_tx_burst_mode_get()``. > > +.. _nic_features_get_monitor_addr: > + > +PMD power management using monitor addresses > +-------------------------------------------- > + > +Supports getting a monitoring condition to use together with Ethernet PMD > power > +management (see :doc:`../prog_guide/power_man` for more details). > + > +* **[implements] eth_dev_ops**: ``get_monitor_addr`` > + > .. _nic_features_other: > > Other dev ops not represented by a Feature > diff --git a/doc/guides/prog_guide/power_man.rst > b/doc/guides/prog_guide/power_man.rst > index c70ae128ac..ec04a72108 100644 > --- a/doc/guides/prog_guide/power_man.rst > +++ b/doc/guides/prog_guide/power_man.rst > @@ -198,34 +198,41 @@ Ethernet PMD Power Management API > Abstract > ~~~~~~~~ > > -Existing power management mechanisms require developers > -to change application design or change code to make use of it. > -The PMD power management API provides a convenient alternative > -by utilizing Ethernet PMD RX callbacks, > -and triggering power saving whenever empty poll count reaches a certain > number. > - > -Monitor > - This power saving scheme will put the CPU into optimized power state > - and use the ``rte_power_monitor()`` function > - to monitor the Ethernet PMD RX descriptor address, > - and wake the CPU up whenever there's new traffic. > - > -Pause > - This power saving scheme will avoid busy polling > - by either entering power-optimized sleep state > - with ``rte_power_pause()`` function, > - or, if it's not available, use ``rte_pause()``. > - > -Frequency scaling > - This power saving scheme will use ``librte_power`` library > - functionality to scale the core frequency up/down > - depending on traffic volume. > - > -.. note:: > - > - Currently, this power management API is limited to mandatory mapping > - of 1 queue to 1 core (multiple queues are supported, > - but they must be polled from different cores). > +Existing power management mechanisms require developers to change application > +design or change code to make use of it. The PMD power management API > provides a > +convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering > +power saving whenever empty poll count reaches a certain number. > + > +* Monitor > + This power saving scheme will put the CPU into optimized power state and > + monitor the Ethernet PMD RX descriptor address, waking the CPU up whenever > + there's new traffic. Support for this scheme may not be available on all > + platforms, and further limitations may apply (see below). > + > +* Pause > + This power saving scheme will avoid busy polling by either entering > + power-optimized sleep state with ``rte_power_pause()`` function, or, if > it's > + not supported by the underlying platform, use ``rte_pause()``. > + > +* Frequency scaling > + This power saving scheme will use ``librte_power`` library functionality > to > + scale the core frequency up/down depending on traffic volume. > + > +The "monitor" mode is only supported in the following configurations and > scenarios: > + > +* If ``rte_cpu_get_intrinsics_support()`` function indicates that > + ``rte_power_monitor()`` is supported by the platform, then monitoring will > be > + limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be > + monitored from a different lcore). > + > +* If ``rte_cpu_get_intrinsics_support()`` function indicates that the > + ``rte_power_monitor()`` function is not supported, then monitor mode will > not > + be supported. > + > +* Not all Ethernet drivers support monitoring, even if the underlying > + platform may support the necessary CPU instructions. Please refer to > + :doc:`../nics/overview` for more information. > + .... > +static inline void > +queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg) > +{ > + const bool is_ready_to_sleep = qcfg->n_empty_polls > EMPTYPOLL_MAX; > + > + /* reset empty poll counter for this queue */ > + qcfg->n_empty_polls = 0; > + /* reset the queue sleep counter as well */ > + qcfg->n_sleeps = 0; > + /* remove the queue from list of cores ready to sleep */ > + if (is_ready_to_sleep) > + cfg->n_queues_ready_to_sleep--; > + /* > + * no need change the lcore sleep target counter because this lcore will > + * reach the n_sleeps anyway, and the other cores are already counted so > + * there's no need to do anything else. > + */ > +} > + > +static inline bool > +queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg) > +{ > + /* this function is called - that means we have an empty poll */ > + qcfg->n_empty_polls++; > + > + /* if we haven't reached threshold for empty polls, we can't sleep */ > + if (qcfg->n_empty_polls <= EMPTYPOLL_MAX) > + return false; > + > + /* > + * we've reached a point where we are able to sleep, but we still need > + * to check if this queue has already been marked for sleeping. > + */ > + if (qcfg->n_sleeps == cfg->sleep_target) > + return true; > + > + /* mark this queue as ready for sleep */ > + qcfg->n_sleeps = cfg->sleep_target; > + cfg->n_queues_ready_to_sleep++;
So, assuming there is no incoming traffic, should it be: 1) poll_all_queues(times=EMPTYPOLL_MAX); sleep; poll_all_queues(times=1); sleep; poll_all_queues(times=1); sleep; ... OR 2) poll_all_queues(times=EMPTYPOLL_MAX); sleep; poll_all_queues(times= EMPTYPOLL_MAX); sleep; poll_all_queues(times= EMPTYPOLL_MAX); sleep; ... ? My initial thought was 2) but might be the intention is 1)? > + > + return true; > +} > + > +static inline bool > +lcore_can_sleep(struct pmd_core_cfg *cfg) > +{ > + /* are all queues ready to sleep? */ > + if (cfg->n_queues_ready_to_sleep != cfg->n_queues) > + return false; > + > + /* we've reached an iteration where we can sleep, reset sleep counter */ > + cfg->n_queues_ready_to_sleep = 0; > + cfg->sleep_target++; > + > + return true; > +} > + > static uint16_t > clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts > __rte_unused, > - uint16_t nb_rx, uint16_t max_pkts __rte_unused, > - void *addr __rte_unused) > + uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg) > { > + struct queue_list_entry *queue_conf = arg; > > - struct pmd_queue_cfg *q_conf; > - > - q_conf = &port_cfg[port_id][qidx]; > - > + /* this callback can't do more than one queue, omit multiqueue logic */ > if (unlikely(nb_rx == 0)) { > - q_conf->empty_poll_stats++; > - if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) { > + queue_conf->n_empty_polls++; > + if (unlikely(queue_conf->n_empty_polls > EMPTYPOLL_MAX)) { > struct rte_power_monitor_cond pmc; > - uint16_t ret; > + int ret; > > /* use monitoring condition to sleep */ > ret = rte_eth_get_monitor_addr(port_id, qidx, > @@ -97,60 +231,77 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct > rte_mbuf **pkts __rte_unused, > rte_power_monitor(&pmc, UINT64_MAX); > } > } else > - q_conf->empty_poll_stats = 0; > + queue_conf->n_empty_polls = 0; > > return nb_rx; > } >