> >> > >>>>>>>>> Add two new power management intrinsics, and provide an > >>>>>>>>> implementation > >>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions > >>>>>>>>> are implemented as raw byte opcodes because there is not yet > >>>>>>>>> widespread > >>>>>>>>> compiler support for these instructions. > >>>>>>>>> > >>>>>>>>> The power management instructions provide an architecture-specific > >>>>>>>>> function to either wait until a specified TSC timestamp is > >>>>>>>>> reached, or > >>>>>>>>> optionally wait until either a TSC timestamp is reached or a > >>>>>>>>> memory > >>>>>>>>> location is written to. The monitor function also provides an > >>>>>>>>> optional > >>>>>>>>> comparison, to avoid sleeping when the expected write has already > >>>>>>>>> happened, and no more writes are expected. > >>>>>>>> > >>>>>>>> I think what this API is missing - a function to wakeup sleeping > >>>>>>>> core. > >>>>>>>> If user can/should use some system call to achieve that, then at > >>>>>>>> least > >>>>>>>> it has to be clearly documented, even better some wrapper provided. > >>>>>>> > >>>>>>> I don't think it's possible to do that without severely > >>>>>>> overcomplicating > >>>>>>> the intrinsic and its usage, because AFAIK the only way to wake up a > >>>>>>> sleeping core would be to send some kind of interrupt to the > >>>>>>> core, or > >>>>>>> trigger a write to the cache-line in question. > >>>>>>> > >>>>>> > >>>>>> Yes, I think we either need a syscall that would do an IPI for us > >>>>>> (on top of my head - membarrier() does that, might be there are > >>>>>> some other syscalls too), > >>>>>> or something hand-made. For hand-made, I wonder would something > >>>>>> like that > >>>>>> be safe and sufficient: > >>>>>> uint64_t val = atomic_load(addr); > >>>>>> CAS(addr, val, &val); > >>>>>> ? > >>>>>> Anyway, one way or another - I think ability to wakeup core we put > >>>>>> to sleep > >>>>>> have to be an essential part of this feature. > >>>>>> As I understand linux kernel will limit max amount of sleep time > >>>>>> for these instructions: > >>>>>> https://lwn.net/Articles/790920/ > >>>>>> But relying just on that, seems too vague for me: > >>>>>> - user can adjust that value > >>>>>> - wouldn't apply to older kernels and non-linux cases > >>>>>> Konstantin > >>>>>> > >>>>> > >>>>> This implies knowing the value the core is sleeping on. > >>>> > >>>> You don't the value to wait for, you just need an address. > >>>> And you can make wakeup function to accept address as a parameter, > >>>> same as monitor() does. > >>> > >>> Sorry, i meant the address. We don't know the address we're sleeping on. > >>> > >>>> > >>>>> That's not > >>>>> always the case - with this particular PMD power management scheme, we > >>>>> get the address from the PMD and it stays inside the callback. > >>>> > >>>> That's fine - you can store address inside you callback metadata > >>>> and do wakeup as part of _disable_ function. > >>>> > >>> > >>> The address may be different, and by the time we access the address it > >>> may become stale, so i don't see how that would help unless you're > >>> suggesting to have some kind of synchronization mechanism there. > >> > >> Yes, we'll need something to sync here for sure. > >> Sorry, I should say it straightway, to avoid further misunderstanding. > >> Let say, associate a spin_lock with monitor(), by analogy with > >> pthread_cond_wait(). > >> Konstantin > >> > > > > The idea was to provide an intrinsic-like function - as in, raw > > instruction call, without anything extra. We even added the masks/values > > etc. only because there's no race-less way to combine UMONITOR/UMWAIT > > without those. >> > > Perhaps we can provide a synchronize-able wrapper around it to avoid > > adding overhead to calls that function but doesn't need the sync mechanism?
Yes, might be two flavours, something like rte_power_monitor() and rte_power_monitor_sync() or whatever would be a better name. > > > > Also, how would having a spinlock help to synchronize? Are you > suggesting we do UMWAIT on a spinlock address, or something to that effect? > I thought about something very similar to cond_wait() working model: /* * Caller has to obtain lock before calling that function. */ static inline int rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint32_t state, const uint64_t tsc_timestamp, rte_spinlock_t *lck) { /* do whatever preparations are needed */ .... umonitor(p); if (value_mask != 0 && *((const uint64_t *)p) & value_mask == expected_value) { return 0; } /* release lock and go to sleep */ rte_spinlock_unlock(lck); rflags = umwait(); /* grab lock back after wakeup */ rte_spinlock_lock(lck); /* do rest of processing */ .... } /* similar go cond_signal */ static inline void rte_power_monitor_wakeup(volatile void *p) { uint64_t v; v = __atomic_load_n(p, __ATOMIC_RELAXED); __atomic_compare_exchange_n(p, v, &v, 1, __ATOMIC_RELAXED, __ATOMIC_RELAXED); } Now in librte_power: struct pmd_queue_cfg { /* to protect state and wait_addr */ rte_spinlock_t lck; enum pmd_mgmt_state pwr_mgmt_state; void *wait_addr; /* rest fields */ .... } __rte_cache_aligned; static uint16_t rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *_ __rte_unused) { struct pmd_queue_cfg *q_conf; q_conf = &port_cfg[port_id].queue_cfg[qidx]; if (unlikely(nb_rx == 0)) { q_conf->empty_poll_stats++; if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) { volatile void *target_addr; uint64_t expected, mask; uint16_t ret; /* grab the lock and check the state */ rte_spinlock_lock(&q_conf->lck); If (q-conf->state == ENABLED) { ret = rte_eth_get_wake_addr(port_id, qidx, &target_addr, &expected, &mask); If (ret == 0) { q_conf->wait_addr = target_addr; rte_power_monitor(target_addr, ..., &q_conf->lck); } /* reset the wait_addr */ q_conf->wait_addr = NULL; } rte_spinlock_unlock(&q_conf->lck); .... } nt rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id, uint16_t port_id, uint16_t queue_id) { ... /* grab the lock and change the state */ rte_spinlock_lock(&q_conf->lck); queue_cfg->state = DISABLED; /* wakeup if necessary */ If (queue_cfg->wakeup_addr != NULL) rte_power_monitor_wakeup(queue_cfg->wakeup_addr); rte_spinlock_unlock(&q_conf->lck); ... }