Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics

Ananyev, Konstantin Mon, 12 Oct 2020 05:50:52 -0700

> >>
> >>>>>>>>> Add two new power management intrinsics, and provide an
> >>>>>>>>> implementation
> >>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>>>>> are implemented as raw byte opcodes because there is not yet
> >>>>>>>>> widespread
> >>>>>>>>> compiler support for these instructions.
> >>>>>>>>>
> >>>>>>>>> The power management instructions provide an architecture-specific
> >>>>>>>>> function to either wait until a specified TSC timestamp is
> >>>>>>>>> reached, or
> >>>>>>>>> optionally wait until either a TSC timestamp is reached or a
> >>>>>>>>> memory
> >>>>>>>>> location is written to. The monitor function also provides an
> >>>>>>>>> optional
> >>>>>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>>>>> happened, and no more writes are expected.
> >>>>>>>>
> >>>>>>>> I think what this API is missing - a function to wakeup sleeping
> >>>>>>>> core.
> >>>>>>>> If user can/should use some system call to achieve that, then at
> >>>>>>>> least
> >>>>>>>> it has to be clearly documented, even better some wrapper provided.
> >>>>>>>
> >>>>>>> I don't think it's possible to do that without severely
> >>>>>>> overcomplicating
> >>>>>>> the intrinsic and its usage, because AFAIK the only way to wake up a
> >>>>>>> sleeping core would be to send some kind of interrupt to the
> >>>>>>> core, or
> >>>>>>> trigger a write to the cache-line in question.
> >>>>>>>
> >>>>>>
> >>>>>> Yes, I think we either need a syscall that would do an IPI for us
> >>>>>> (on top of my head - membarrier() does that, might be there are
> >>>>>> some other syscalls too),
> >>>>>> or something hand-made. For hand-made, I wonder would something
> >>>>>> like that
> >>>>>> be safe and sufficient:
> >>>>>> uint64_t val = atomic_load(addr);
> >>>>>> CAS(addr, val, &val);
> >>>>>> ?
> >>>>>> Anyway, one way or another - I think ability to wakeup core we put
> >>>>>> to sleep
> >>>>>> have to be an essential part of this feature.
> >>>>>> As I understand linux kernel will limit max amount of sleep time
> >>>>>> for these instructions:
> >>>>>> https://lwn.net/Articles/790920/
> >>>>>> But relying just on that, seems too vague for me:
> >>>>>> - user can adjust that value
> >>>>>> - wouldn't apply to older kernels and non-linux cases
> >>>>>> Konstantin
> >>>>>>
> >>>>>
> >>>>> This implies knowing the value the core is sleeping on.
> >>>>
> >>>> You don't the value to wait for, you just need an address.
> >>>> And you can make wakeup function to accept address as a parameter,
> >>>> same as monitor() does.
> >>>
> >>> Sorry, i meant the address. We don't know the address we're sleeping on.
> >>>
> >>>>
> >>>>> That's not
> >>>>> always the case - with this particular PMD power management scheme, we
> >>>>> get the address from the PMD and it stays inside the callback.
> >>>>
> >>>> That's fine - you can store address inside you callback metadata
> >>>> and do wakeup as part of _disable_ function.
> >>>>
> >>>
> >>> The address may be different, and by the time we access the address it
> >>> may become stale, so i don't see how that would help unless you're
> >>> suggesting to have some kind of synchronization mechanism there.
> >>
> >> Yes, we'll need something to sync here for sure.
> >> Sorry, I should say it straightway, to avoid further misunderstanding.
> >> Let say, associate a spin_lock with monitor(), by analogy with
> >> pthread_cond_wait().
> >> Konstantin
> >>
> >
> > The idea was to provide an intrinsic-like function - as in, raw
> > instruction call, without anything extra. We even added the masks/values
> > etc. only because there's no race-less way to combine UMONITOR/UMWAIT
> > without those.
>>
> > Perhaps we can provide a synchronize-able wrapper around it to avoid
> > adding overhead to calls that function but doesn't need the sync mechanism?


Yes, might be two flavours, something like
rte_power_monitor() and rte_power_monitor_sync() 
or whatever would be a better name.

> >
> 
> Also, how would having a spinlock help to synchronize? Are you
> suggesting we do UMWAIT on a spinlock address, or something to that effect?
> 

I thought about something very similar to cond_wait() working model:

/*
 * Caller has to obtain lock before calling that function.
 */
static inline int rte_power_monitor_sync(const volatile void *p,
                const uint64_t expected_value, const uint64_t value_mask,
                const uint32_t state, const uint64_t tsc_timestamp, 
rte_spinlock_t *lck)
{
        /* do whatever preparations are needed */
               ....
        umonitor(p);

        if (value_mask != 0 && *((const uint64_t *)p) & value_mask == 
expected_value) {
                return 0;
        }
        
        /* release lock and go to sleep */
        rte_spinlock_unlock(lck);
        rflags = umwait();

        /* grab lock back after wakeup */
        rte_spinlock_lock(lck);

        /* do rest of processing */
        ....
}

/* similar go cond_signal */
static inline void rte_power_monitor_wakeup(volatile void *p)
{
        uint64_t v;

        v = __atomic_load_n(p, __ATOMIC_RELAXED);
        __atomic_compare_exchange_n(p, v, &v, 1, __ATOMIC_RELAXED, 
__ATOMIC_RELAXED);
}               


Now in librte_power:

struct pmd_queue_cfg {
       /* to protect state and wait_addr */
       rte_spinlock_t lck;
       enum pmd_mgmt_state pwr_mgmt_state;
       void *wait_addr;
       /* rest fields */
      ....
} __rte_cache_aligned;


static uint16_t
rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
                struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
                uint16_t max_pkts __rte_unused, void *_  __rte_unused)
{

        struct pmd_queue_cfg *q_conf;
        q_conf = &port_cfg[port_id].queue_cfg[qidx];

        if (unlikely(nb_rx == 0)) {
                q_conf->empty_poll_stats++;
                if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
                        volatile void *target_addr;
                        uint64_t expected, mask;
                        uint16_t ret;
                
                 /* grab the lock and check the state */
                       rte_spinlock_lock(&q_conf->lck);
                 If (q-conf->state == ENABLED) {
                                ret = rte_eth_get_wake_addr(port_id, qidx,
                                                    &target_addr, &expected, 
&mask);
                          If (ret == 0) {
                        q_conf->wait_addr = target_addr;
                        rte_power_monitor(target_addr, ..., &q_conf->lck);
                         }      
                          /* reset the wait_addr */
                          q_conf->wait_addr = NULL;
                 }
                 rte_spinlock_unlock(&q_conf->lck);     
                 ....
}

nt
rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
                                uint16_t port_id,
                                uint16_t queue_id)
{
        ...
        /* grab the lock and change the state */
               rte_spinlock_lock(&q_conf->lck);
        queue_cfg->state = DISABLED;

        /* wakeup if necessary */
        If (queue_cfg->wakeup_addr != NULL)
                rte_power_monitor_wakeup(queue_cfg->wakeup_addr);

        rte_spinlock_unlock(&q_conf->lck);
        ...
}

Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics

Reply via email to