On 12-Oct-20 1:50 PM, Ananyev, Konstantin wrote:
Add two new power management intrinsics, and provide an
implementation
in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
are implemented as raw byte opcodes because there is not yet
widespread
compiler support for these instructions.
The power management instructions provide an architecture-specific
function to either wait until a specified TSC timestamp is
reached, or
optionally wait until either a TSC timestamp is reached or a
memory
location is written to. The monitor function also provides an
optional
comparison, to avoid sleeping when the expected write has already
happened, and no more writes are expected.
I think what this API is missing - a function to wakeup sleeping
core.
If user can/should use some system call to achieve that, then at
least
it has to be clearly documented, even better some wrapper provided.
I don't think it's possible to do that without severely
overcomplicating
the intrinsic and its usage, because AFAIK the only way to wake up a
sleeping core would be to send some kind of interrupt to the
core, or
trigger a write to the cache-line in question.
Yes, I think we either need a syscall that would do an IPI for us
(on top of my head - membarrier() does that, might be there are
some other syscalls too),
or something hand-made. For hand-made, I wonder would something
like that
be safe and sufficient:
uint64_t val = atomic_load(addr);
CAS(addr, val, &val);
?
Anyway, one way or another - I think ability to wakeup core we put
to sleep
have to be an essential part of this feature.
As I understand linux kernel will limit max amount of sleep time
for these instructions:
https://lwn.net/Articles/790920/
But relying just on that, seems too vague for me:
- user can adjust that value
- wouldn't apply to older kernels and non-linux cases
Konstantin
This implies knowing the value the core is sleeping on.
You don't the value to wait for, you just need an address.
And you can make wakeup function to accept address as a parameter,
same as monitor() does.
Sorry, i meant the address. We don't know the address we're sleeping on.
That's not
always the case - with this particular PMD power management scheme, we
get the address from the PMD and it stays inside the callback.
That's fine - you can store address inside you callback metadata
and do wakeup as part of _disable_ function.
The address may be different, and by the time we access the address it
may become stale, so i don't see how that would help unless you're
suggesting to have some kind of synchronization mechanism there.
Yes, we'll need something to sync here for sure.
Sorry, I should say it straightway, to avoid further misunderstanding.
Let say, associate a spin_lock with monitor(), by analogy with
pthread_cond_wait().
Konstantin
The idea was to provide an intrinsic-like function - as in, raw
instruction call, without anything extra. We even added the masks/values
etc. only because there's no race-less way to combine UMONITOR/UMWAIT
without those.
Perhaps we can provide a synchronize-able wrapper around it to avoid
adding overhead to calls that function but doesn't need the sync mechanism?
Yes, might be two flavours, something like
rte_power_monitor() and rte_power_monitor_sync()
or whatever would be a better name.
Also, how would having a spinlock help to synchronize? Are you
suggesting we do UMWAIT on a spinlock address, or something to that effect?
I thought about something very similar to cond_wait() working model:
/*
* Caller has to obtain lock before calling that function.
*/
static inline int rte_power_monitor_sync(const volatile void *p,
const uint64_t expected_value, const uint64_t value_mask,
const uint32_t state, const uint64_t tsc_timestamp,
rte_spinlock_t *lck)
{
/* do whatever preparations are needed */
....
umonitor(p);
if (value_mask != 0 && *((const uint64_t *)p) & value_mask == expected_value) {
return 0;
}
/* release lock and go to sleep */
rte_spinlock_unlock(lck);
rflags = umwait();
/* grab lock back after wakeup */
rte_spinlock_lock(lck);
/* do rest of processing */
....
}
/* similar go cond_signal */
static inline void rte_power_monitor_wakeup(volatile void *p)
{
uint64_t v;
v = __atomic_load_n(p, __ATOMIC_RELAXED);
__atomic_compare_exchange_n(p, v, &v, 1, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
}
Now in librte_power:
struct pmd_queue_cfg {
/* to protect state and wait_addr */
rte_spinlock_t lck;
enum pmd_mgmt_state pwr_mgmt_state;
void *wait_addr;
/* rest fields */
....
} __rte_cache_aligned;
static uint16_t
rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
uint16_t max_pkts __rte_unused, void *_ __rte_unused)
{
struct pmd_queue_cfg *q_conf;
q_conf = &port_cfg[port_id].queue_cfg[qidx];
if (unlikely(nb_rx == 0)) {
q_conf->empty_poll_stats++;
if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
volatile void *target_addr;
uint64_t expected, mask;
uint16_t ret;
/* grab the lock and check the state */
rte_spinlock_lock(&q_conf->lck);
If (q-conf->state == ENABLED) {
ret = rte_eth_get_wake_addr(port_id, qidx,
&target_addr, &expected,
&mask);
If (ret == 0) {
q_conf->wait_addr = target_addr;
rte_power_monitor(target_addr, ..., &q_conf->lck);
}
/* reset the wait_addr */
q_conf->wait_addr = NULL;
}
rte_spinlock_unlock(&q_conf->lck);
....
}
nt
rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
uint16_t port_id,
uint16_t queue_id)
{
...
/* grab the lock and change the state */
rte_spinlock_lock(&q_conf->lck);
queue_cfg->state = DISABLED;
/* wakeup if necessary */
If (queue_cfg->wakeup_addr != NULL)
rte_power_monitor_wakeup(queue_cfg->wakeup_addr);
rte_spinlock_unlock(&q_conf->lck);
...
}
Yeah, seems that i understood you correctly the first time then. I'm not
completely convinced that this overhead and complexity is worth the
trouble, to be honest. I mean, it's not like we're going to sleep
indefinitely, this isn't like pthread wait - the biggest sleep time i've
seen was around half a second and i'm not sure there is a use case for
enabling/disabling this functionality willy nilly ever 5 seconds.
--
Thanks,
Anatoly