> Currently, there is a hard limitation on the PMD power management
> support that only allows it to support a single queue per lcore. This is
> not ideal as most DPDK use cases will poll multiple queues per core.
> 
> The PMD power management mechanism relies on ethdev Rx callbacks, so it
> is very difficult to implement such support because callbacks are
> effectively stateless and have no visibility into what the other ethdev
> devices are doing. This places limitations on what we can do within the
> framework of Rx callbacks, but the basics of this implementation are as
> follows:
> 
> - Replace per-queue structures with per-lcore ones, so that any device
>   polled from the same lcore can share data
> - Any queue that is going to be polled from a specific lcore has to be
>   added to the list of queues to poll, so that the callback is aware of
>   other queues being polled by the same lcore
> - Both the empty poll counter and the actual power saving mechanism is
>   shared between all queues polled on a particular lcore, and is only
>   activated when all queues in the list were polled and were determined
>   to have no traffic.
> - The limitation on UMWAIT-based polling is not removed because UMWAIT
>   is incapable of monitoring more than one address.
> 
> Also, while we're at it, update and improve the docs.
> 
> Signed-off-by: Anatoly Burakov <anatoly.bura...@intel.com>
> ---
> 
> Notes:
>     v6:
>     - Track each individual queue sleep status (Konstantin)
>     - Fix segfault (Dave)
> 
>     v5:
>     - Remove the "power save queue" API and replace it with mechanism 
> suggested by
>       Konstantin
> 
>     v3:
>     - Move the list of supported NICs to NIC feature table
> 
>     v2:
>     - Use a TAILQ for queues instead of a static array
>     - Address feedback from Konstantin
>     - Add additional checks for stopped queues
> 
>  doc/guides/nics/features.rst           |  10 +
>  doc/guides/prog_guide/power_man.rst    |  65 ++--
>  doc/guides/rel_notes/release_21_08.rst |   3 +
>  lib/power/rte_power_pmd_mgmt.c         | 452 +++++++++++++++++++------
>  4 files changed, 394 insertions(+), 136 deletions(-)
> 
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index 403c2b03a3..a96e12d155 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information.
>  * **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``.
>  * **[related] API**: ``rte_eth_rx_burst_mode_get()``, 
> ``rte_eth_tx_burst_mode_get()``.
> 
> +.. _nic_features_get_monitor_addr:
> +
> +PMD power management using monitor addresses
> +--------------------------------------------
> +
> +Supports getting a monitoring condition to use together with Ethernet PMD 
> power
> +management (see :doc:`../prog_guide/power_man` for more details).
> +
> +* **[implements] eth_dev_ops**: ``get_monitor_addr``
> +
>  .. _nic_features_other:
> 
>  Other dev ops not represented by a Feature
> diff --git a/doc/guides/prog_guide/power_man.rst 
> b/doc/guides/prog_guide/power_man.rst
> index c70ae128ac..ec04a72108 100644
> --- a/doc/guides/prog_guide/power_man.rst
> +++ b/doc/guides/prog_guide/power_man.rst
> @@ -198,34 +198,41 @@ Ethernet PMD Power Management API
>  Abstract
>  ~~~~~~~~
> 
> -Existing power management mechanisms require developers
> -to change application design or change code to make use of it.
> -The PMD power management API provides a convenient alternative
> -by utilizing Ethernet PMD RX callbacks,
> -and triggering power saving whenever empty poll count reaches a certain 
> number.
> -
> -Monitor
> -   This power saving scheme will put the CPU into optimized power state
> -   and use the ``rte_power_monitor()`` function
> -   to monitor the Ethernet PMD RX descriptor address,
> -   and wake the CPU up whenever there's new traffic.
> -
> -Pause
> -   This power saving scheme will avoid busy polling
> -   by either entering power-optimized sleep state
> -   with ``rte_power_pause()`` function,
> -   or, if it's not available, use ``rte_pause()``.
> -
> -Frequency scaling
> -   This power saving scheme will use ``librte_power`` library
> -   functionality to scale the core frequency up/down
> -   depending on traffic volume.
> -
> -.. note::
> -
> -   Currently, this power management API is limited to mandatory mapping
> -   of 1 queue to 1 core (multiple queues are supported,
> -   but they must be polled from different cores).
> +Existing power management mechanisms require developers to change application
> +design or change code to make use of it. The PMD power management API 
> provides a
> +convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
> +power saving whenever empty poll count reaches a certain number.
> +
> +* Monitor
> +   This power saving scheme will put the CPU into optimized power state and
> +   monitor the Ethernet PMD RX descriptor address, waking the CPU up whenever
> +   there's new traffic. Support for this scheme may not be available on all
> +   platforms, and further limitations may apply (see below).
> +
> +* Pause
> +   This power saving scheme will avoid busy polling by either entering
> +   power-optimized sleep state with ``rte_power_pause()`` function, or, if 
> it's
> +   not supported by the underlying platform, use ``rte_pause()``.
> +
> +* Frequency scaling
> +   This power saving scheme will use ``librte_power`` library functionality 
> to
> +   scale the core frequency up/down depending on traffic volume.
> +
> +The "monitor" mode is only supported in the following configurations and 
> scenarios:
> +
> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that
> +  ``rte_power_monitor()`` is supported by the platform, then monitoring will 
> be
> +  limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
> +  monitored from a different lcore).
> +
> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
> +  ``rte_power_monitor()`` function is not supported, then monitor mode will 
> not
> +  be supported.
> +
> +* Not all Ethernet drivers support monitoring, even if the underlying
> +  platform may support the necessary CPU instructions. Please refer to
> +  :doc:`../nics/overview` for more information.
> +
.... 
> +static inline void
> +queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
> +{
> +     const bool is_ready_to_sleep = qcfg->n_empty_polls > EMPTYPOLL_MAX;
> +
> +     /* reset empty poll counter for this queue */
> +     qcfg->n_empty_polls = 0;
> +     /* reset the queue sleep counter as well */
> +     qcfg->n_sleeps = 0;
> +     /* remove the queue from list of cores ready to sleep */
> +     if (is_ready_to_sleep)
> +             cfg->n_queues_ready_to_sleep--;
> +     /*
> +      * no need change the lcore sleep target counter because this lcore will
> +      * reach the n_sleeps anyway, and the other cores are already counted so
> +      * there's no need to do anything else.
> +      */
> +}
> +
> +static inline bool
> +queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
> +{
> +     /* this function is called - that means we have an empty poll */
> +     qcfg->n_empty_polls++;
> +
> +     /* if we haven't reached threshold for empty polls, we can't sleep */
> +     if (qcfg->n_empty_polls <= EMPTYPOLL_MAX)
> +             return false;
> +
> +     /*
> +      * we've reached a point where we are able to sleep, but we still need
> +      * to check if this queue has already been marked for sleeping.
> +      */
> +     if (qcfg->n_sleeps == cfg->sleep_target)
> +             return true;
> +
> +     /* mark this queue as ready for sleep */
> +     qcfg->n_sleeps = cfg->sleep_target;
> +     cfg->n_queues_ready_to_sleep++;

So, assuming there is no incoming traffic, should it be:
1) poll_all_queues(times=EMPTYPOLL_MAX); sleep; poll_all_queues(times=1); 
sleep; poll_all_queues(times=1); sleep; ...
OR
2) poll_all_queues(times=EMPTYPOLL_MAX); sleep; poll_all_queues(times= 
EMPTYPOLL_MAX); sleep; poll_all_queues(times= EMPTYPOLL_MAX); sleep; ...
?

My initial thought was 2) but might be the intention is 1)?

> +
> +     return true;
> +}
> +
> +static inline bool
> +lcore_can_sleep(struct pmd_core_cfg *cfg)
> +{
> +     /* are all queues ready to sleep? */
> +     if (cfg->n_queues_ready_to_sleep != cfg->n_queues)
> +             return false;
> +
> +     /* we've reached an iteration where we can sleep, reset sleep counter */
> +     cfg->n_queues_ready_to_sleep = 0;
> +     cfg->sleep_target++;
> +
> +     return true;
> +}
> +
>  static uint16_t
>  clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts 
> __rte_unused,
> -             uint16_t nb_rx, uint16_t max_pkts __rte_unused,
> -             void *addr __rte_unused)
> +             uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
>  {
> +     struct queue_list_entry *queue_conf = arg;
> 
> -     struct pmd_queue_cfg *q_conf;
> -
> -     q_conf = &port_cfg[port_id][qidx];
> -
> +     /* this callback can't do more than one queue, omit multiqueue logic */
>       if (unlikely(nb_rx == 0)) {
> -             q_conf->empty_poll_stats++;
> -             if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> +             queue_conf->n_empty_polls++;
> +             if (unlikely(queue_conf->n_empty_polls > EMPTYPOLL_MAX)) {
>                       struct rte_power_monitor_cond pmc;
> -                     uint16_t ret;
> +                     int ret;
> 
>                       /* use monitoring condition to sleep */
>                       ret = rte_eth_get_monitor_addr(port_id, qidx,
> @@ -97,60 +231,77 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct 
> rte_mbuf **pkts __rte_unused,
>                               rte_power_monitor(&pmc, UINT64_MAX);
>               }
>       } else
> -             q_conf->empty_poll_stats = 0;
> +             queue_conf->n_empty_polls = 0;
> 
>       return nb_rx;
>  }
> 

Reply via email to