<snip> > > Subject: [PATCH v3 11/12] service: optimize with c11 one-way barrier > > > > The num_mapped_cores and execute_lock are synchronized with > > rte_atomic_XX APIs which is a full barrier, DMB, on aarch64. This > > patch optimized it with > > c11 atomic one-way barrier. > > > > Signed-off-by: Phil Yang <phil.y...@arm.com> > > Reviewed-by: Ruifeng Wang <ruifeng.w...@arm.com> > > Reviewed-by: Gavin Hu <gavin...@arm.com> > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com> > > Based on discussion on-list, it seems the consensus is to not use GCC > builtins, > but instead use C11 APIs "proper"? If my conclusion is correct, the v+1 of > this > patchset would require updates to that style API. > > Inline comments for context below, -Harry > > > > --- > > lib/librte_eal/common/rte_service.c | 50 > > ++++++++++++++++++++++++++---------- > > - > > 1 file changed, 35 insertions(+), 15 deletions(-) > > > > diff --git a/lib/librte_eal/common/rte_service.c > > b/lib/librte_eal/common/rte_service.c > > index 0843c3c..c033224 100644 > > --- a/lib/librte_eal/common/rte_service.c > > +++ b/lib/librte_eal/common/rte_service.c > > @@ -42,7 +42,7 @@ struct rte_service_spec_impl { > > * running this service callback. When not set, a core may take the > > * lock and then run the service callback. > > */ > > - rte_atomic32_t execute_lock; > > + uint32_t execute_lock; > > > > /* API set/get-able variables */ > > int8_t app_runstate; > > @@ -54,7 +54,7 @@ struct rte_service_spec_impl { > > * It does not indicate the number of cores the service is running > > * on currently. > > */ > > - rte_atomic32_t num_mapped_cores; > > + int32_t num_mapped_cores; > > Any reason why "int32_t" or "uint32_t" is used over another? > execute_lock is a uint32_t above, num_mapped_cores is an int32_t? > > > > uint64_t calls; > > uint64_t cycles_spent; > > } __rte_cache_aligned; > > @@ -332,7 +332,8 @@ rte_service_runstate_get(uint32_t id) > > rte_smp_rmb(); > > > > int check_disabled = !(s->internal_flags & SERVICE_F_START_CHECK); > > - int lcore_mapped = (rte_atomic32_read(&s->num_mapped_cores) > > 0); > > + int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores, > > + __ATOMIC_RELAXED) > 0); > > > > return (s->app_runstate == RUNSTATE_RUNNING) && > > (s->comp_runstate == RUNSTATE_RUNNING) && @@ -375,11 > +376,20 @@ > > service_run(uint32_t i, struct core_state *cs, uint64_t service_mask, > > cs->service_active_on_lcore[i] = 1; > > > > if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) { > > - if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1)) > > + uint32_t expected = 0; > > + /* ACQUIRE ordering here is to prevent the callback > > + * function from hoisting up before the execute_lock > > + * setting. > > + */ > > + if (!__atomic_compare_exchange_n(&s->execute_lock, > &expected, 1, > > + 0, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) > > return -EBUSY; > > Let's try improve the magic "1" and "0" constants, I believe the "1" here is > the > desired "new value on success", and the 0 is "bool weak", where our 0/false > constant implies a strongly ordered compare exchange? > > "Weak is true for weak compare_exchange, which may fail spuriously, and > false for the strong variation, which never fails spuriously.", from > https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html > > const uint32_t on_success_value = 1; > const bool weak = 0; > __atomic_compare_exchange_n(&s->execute_lock, &expected, > on_success_value, weak, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED); > > > Although a bit more verbose, I feel this documents usage a lot better, > particularly for those who aren't as familiar with the C11 function arguments > order. > > Admittedly with the API change to not use __builtins, perhaps this comment is > moot. Suggest changing the execute_lock to rte_spinlock_t and use rte_spinlock_trylock API.
> > > > > > service_runner_do_callback(s, cs, i); > > - rte_atomic32_clear(&s->execute_lock); > > + /* RELEASE ordering here is used to pair with ACQUIRE > > + * above to achieve lock semantic. > > + */ > > + __atomic_store_n(&s->execute_lock, 0, __ATOMIC_RELEASE); > > } else > > service_runner_do_callback(s, cs, i); > > > > @@ -415,11 +425,11 @@ rte_service_run_iter_on_app_lcore(uint32_t id, > > uint32_t > > serialize_mt_unsafe) > > /* Increment num_mapped_cores to indicate that the service > > * is running on a core. > > */ > > - rte_atomic32_inc(&s->num_mapped_cores); > > + __atomic_add_fetch(&s->num_mapped_cores, 1, > __ATOMIC_ACQUIRE); > > > > int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe); > > > > - rte_atomic32_dec(&s->num_mapped_cores); > > + __atomic_sub_fetch(&s->num_mapped_cores, 1, > __ATOMIC_RELEASE); > > > > return ret; > > } > > @@ -552,24 +562,32 @@ service_update(uint32_t sid, uint32_t lcore, > > > > uint64_t sid_mask = UINT64_C(1) << sid; > > if (set) { > > - uint64_t lcore_mapped = lcore_states[lcore].service_mask & > > - sid_mask; > > + /* When multiple threads try to update the same lcore > > + * service concurrently, e.g. set lcore map followed > > + * by clear lcore map, the unsynchronized service_mask > > + * values have issues on the num_mapped_cores value > > + * consistency. So we use ACQUIRE ordering to pair with > > + * the RELEASE ordering to synchronize the service_mask. > > + */ > > + uint64_t lcore_mapped = __atomic_load_n( > > + &lcore_states[lcore].service_mask, > > + __ATOMIC_ACQUIRE) & sid_mask; > > Thanks for the comment - it helps me understand things a bit better. > Some questions/theories to validate; > 1) The service_mask ACQUIRE avoids other loads being hoisted above it, > correct? > > 2) There are non-atomic stores to service_mask. Is it correct that the stores > themselves aren't the issue, but relative visibility of service_mask stores vs > num_mapped_cores? (Detail in (3) below) > > > > if (*set && !lcore_mapped) { > > lcore_states[lcore].service_mask |= sid_mask; > > - > rte_atomic32_inc(&rte_services[sid].num_mapped_cores); > > + > __atomic_add_fetch(&rte_services[sid].num_mapped_cores, > > + 1, __ATOMIC_RELEASE); > > } > > if (!*set && lcore_mapped) { > > lcore_states[lcore].service_mask &= ~(sid_mask); > > - > rte_atomic32_dec(&rte_services[sid].num_mapped_cores); > > + > __atomic_sub_fetch(&rte_services[sid].num_mapped_cores, > > + 1, __ATOMIC_RELEASE); > > } > > 3) Here we update the core-local service_mask, and then update the > num_mapped_cores with an ATOMIC_RELEASE. The RELEASE here ensures > that the previous store to service_mask is guaranteed to be visible on all > cores > if this store is visible. Why do we care about this property? > The service_mask is core local anway. We are working on concurrency between the reader and writer. The service_mask is local to the core, but it is accessed by a reader and writer. I think we should wait to conclude on the meaning of 'num_mapped_cores', that will dictate what the order should be. For ex: if it is just for statistics purpose, then we could use just RELAXED memory order and then the order for service_mask will also change. > > 4) Even with the load ACQ service_mask, and REL num_mapped_cores store, > is there not still a race-condition possible where 2 lcores simultaneously > load- > ACQ the service_mask, and then both do atomic add/sub_fetch with REL? > > 5) Assuming 4 above race is true, it raises the real question - the > service-cores > control APIs are not designed to be multi-thread-safe. Orchestration of > service/lcore mappings is not meant to be done by multiple threads at the > same time. Documenting this loudly may help, I'm happy to send a patch to do > so if we're agreed on the above? I completely agree here. writer-writer concurrency is another topic and we should (for now at least) say that the control plane APIs are not thread safe. > > > > > > } > > > > if (enabled) > > *enabled = !!(lcore_states[lcore].service_mask & (sid_mask)); > > > > - rte_smp_wmb(); > > - > > return 0; > > } > > > > @@ -625,7 +643,8 @@ rte_service_lcore_reset_all(void) > > } > > } > > for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) > > - rte_atomic32_set(&rte_services[i].num_mapped_cores, 0); > > + __atomic_store_n(&rte_services[i].num_mapped_cores, 0, > > + __ATOMIC_RELAXED); > > > > rte_smp_wmb(); > > > > @@ -708,7 +727,8 @@ rte_service_lcore_stop(uint32_t lcore) > > int32_t enabled = service_mask & (UINT64_C(1) << i); > > int32_t service_running = rte_service_runstate_get(i); > > int32_t only_core = (1 == > > - > rte_atomic32_read(&rte_services[i].num_mapped_cores)); > > + > __atomic_load_n(&rte_services[i].num_mapped_cores, > > + __ATOMIC_RELAXED)); > > > > /* if the core is mapped, and the service is running, and this > > * is the only core that is mapped, the service would cease to > > -- > > 2.7.4