> -----Original Message----- > From: Phil Yang <phil.y...@arm.com> > Sent: Tuesday, March 17, 2020 1:18 AM > To: tho...@monjalon.net; Van Haaren, Harry <harry.van.haa...@intel.com>; > Ananyev, Konstantin <konstantin.anan...@intel.com>; > step...@networkplumber.org; maxime.coque...@redhat.com; dev@dpdk.org > Cc: david.march...@redhat.com; jer...@marvell.com; hemant.agra...@nxp.com; > honnappa.nagaraha...@arm.com; gavin...@arm.com; ruifeng.w...@arm.com; > joyce.k...@arm.com; n...@arm.com > Subject: [PATCH v3 11/12] service: optimize with c11 one-way barrier > > The num_mapped_cores and execute_lock are synchronized with rte_atomic_XX > APIs which is a full barrier, DMB, on aarch64. This patch optimized it with > c11 atomic one-way barrier. > > Signed-off-by: Phil Yang <phil.y...@arm.com> > Reviewed-by: Ruifeng Wang <ruifeng.w...@arm.com> > Reviewed-by: Gavin Hu <gavin...@arm.com> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>
Based on discussion on-list, it seems the consensus is to not use GCC builtins, but instead use C11 APIs "proper"? If my conclusion is correct, the v+1 of this patchset would require updates to that style API. Inline comments for context below, -Harry > --- > lib/librte_eal/common/rte_service.c | 50 ++++++++++++++++++++++++++---------- > - > 1 file changed, 35 insertions(+), 15 deletions(-) > > diff --git a/lib/librte_eal/common/rte_service.c > b/lib/librte_eal/common/rte_service.c > index 0843c3c..c033224 100644 > --- a/lib/librte_eal/common/rte_service.c > +++ b/lib/librte_eal/common/rte_service.c > @@ -42,7 +42,7 @@ struct rte_service_spec_impl { > * running this service callback. When not set, a core may take the > * lock and then run the service callback. > */ > - rte_atomic32_t execute_lock; > + uint32_t execute_lock; > > /* API set/get-able variables */ > int8_t app_runstate; > @@ -54,7 +54,7 @@ struct rte_service_spec_impl { > * It does not indicate the number of cores the service is running > * on currently. > */ > - rte_atomic32_t num_mapped_cores; > + int32_t num_mapped_cores; Any reason why "int32_t" or "uint32_t" is used over another? execute_lock is a uint32_t above, num_mapped_cores is an int32_t? > uint64_t calls; > uint64_t cycles_spent; > } __rte_cache_aligned; > @@ -332,7 +332,8 @@ rte_service_runstate_get(uint32_t id) > rte_smp_rmb(); > > int check_disabled = !(s->internal_flags & SERVICE_F_START_CHECK); > - int lcore_mapped = (rte_atomic32_read(&s->num_mapped_cores) > 0); > + int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores, > + __ATOMIC_RELAXED) > 0); > > return (s->app_runstate == RUNSTATE_RUNNING) && > (s->comp_runstate == RUNSTATE_RUNNING) && > @@ -375,11 +376,20 @@ service_run(uint32_t i, struct core_state *cs, uint64_t > service_mask, > cs->service_active_on_lcore[i] = 1; > > if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) { > - if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1)) > + uint32_t expected = 0; > + /* ACQUIRE ordering here is to prevent the callback > + * function from hoisting up before the execute_lock > + * setting. > + */ > + if (!__atomic_compare_exchange_n(&s->execute_lock, &expected, 1, > + 0, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) > return -EBUSY; Let's try improve the magic "1" and "0" constants, I believe the "1" here is the desired "new value on success", and the 0 is "bool weak", where our 0/false constant implies a strongly ordered compare exchange? "Weak is true for weak compare_exchange, which may fail spuriously, and false for the strong variation, which never fails spuriously.", from https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html const uint32_t on_success_value = 1; const bool weak = 0; __atomic_compare_exchange_n(&s->execute_lock, &expected, on_success_value, weak, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED); Although a bit more verbose, I feel this documents usage a lot better, particularly for those who aren't as familiar with the C11 function arguments order. Admittedly with the API change to not use __builtins, perhaps this comment is moot. > > service_runner_do_callback(s, cs, i); > - rte_atomic32_clear(&s->execute_lock); > + /* RELEASE ordering here is used to pair with ACQUIRE > + * above to achieve lock semantic. > + */ > + __atomic_store_n(&s->execute_lock, 0, __ATOMIC_RELEASE); > } else > service_runner_do_callback(s, cs, i); > > @@ -415,11 +425,11 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t > serialize_mt_unsafe) > /* Increment num_mapped_cores to indicate that the service > * is running on a core. > */ > - rte_atomic32_inc(&s->num_mapped_cores); > + __atomic_add_fetch(&s->num_mapped_cores, 1, __ATOMIC_ACQUIRE); > > int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe); > > - rte_atomic32_dec(&s->num_mapped_cores); > + __atomic_sub_fetch(&s->num_mapped_cores, 1, __ATOMIC_RELEASE); > > return ret; > } > @@ -552,24 +562,32 @@ service_update(uint32_t sid, uint32_t lcore, > > uint64_t sid_mask = UINT64_C(1) << sid; > if (set) { > - uint64_t lcore_mapped = lcore_states[lcore].service_mask & > - sid_mask; > + /* When multiple threads try to update the same lcore > + * service concurrently, e.g. set lcore map followed > + * by clear lcore map, the unsynchronized service_mask > + * values have issues on the num_mapped_cores value > + * consistency. So we use ACQUIRE ordering to pair with > + * the RELEASE ordering to synchronize the service_mask. > + */ > + uint64_t lcore_mapped = __atomic_load_n( > + &lcore_states[lcore].service_mask, > + __ATOMIC_ACQUIRE) & sid_mask; Thanks for the comment - it helps me understand things a bit better. Some questions/theories to validate; 1) The service_mask ACQUIRE avoids other loads being hoisted above it, correct? 2) There are non-atomic stores to service_mask. Is it correct that the stores themselves aren't the issue, but relative visibility of service_mask stores vs num_mapped_cores? (Detail in (3) below) > if (*set && !lcore_mapped) { > lcore_states[lcore].service_mask |= sid_mask; > - rte_atomic32_inc(&rte_services[sid].num_mapped_cores); > + __atomic_add_fetch(&rte_services[sid].num_mapped_cores, > + 1, __ATOMIC_RELEASE); > } > if (!*set && lcore_mapped) { > lcore_states[lcore].service_mask &= ~(sid_mask); > - rte_atomic32_dec(&rte_services[sid].num_mapped_cores); > + __atomic_sub_fetch(&rte_services[sid].num_mapped_cores, > + 1, __ATOMIC_RELEASE); > } 3) Here we update the core-local service_mask, and then update the num_mapped_cores with an ATOMIC_RELEASE. The RELEASE here ensures that the previous store to service_mask is guaranteed to be visible on all cores if this store is visible. Why do we care about this property? The service_mask is core local anway. 4) Even with the load ACQ service_mask, and REL num_mapped_cores store, is there not still a race-condition possible where 2 lcores simultaneously load-ACQ the service_mask, and then both do atomic add/sub_fetch with REL? 5) Assuming 4 above race is true, it raises the real question - the service-cores control APIs are not designed to be multi-thread-safe. Orchestration of service/lcore mappings is not meant to be done by multiple threads at the same time. Documenting this loudly may help, I'm happy to send a patch to do so if we're agreed on the above? > } > > if (enabled) > *enabled = !!(lcore_states[lcore].service_mask & (sid_mask)); > > - rte_smp_wmb(); > - > return 0; > } > > @@ -625,7 +643,8 @@ rte_service_lcore_reset_all(void) > } > } > for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) > - rte_atomic32_set(&rte_services[i].num_mapped_cores, 0); > + __atomic_store_n(&rte_services[i].num_mapped_cores, 0, > + __ATOMIC_RELAXED); > > rte_smp_wmb(); > > @@ -708,7 +727,8 @@ rte_service_lcore_stop(uint32_t lcore) > int32_t enabled = service_mask & (UINT64_C(1) << i); > int32_t service_running = rte_service_runstate_get(i); > int32_t only_core = (1 == > - rte_atomic32_read(&rte_services[i].num_mapped_cores)); > + __atomic_load_n(&rte_services[i].num_mapped_cores, > + __ATOMIC_RELAXED)); > > /* if the core is mapped, and the service is running, and this > * is the only core that is mapped, the service would cease to > -- > 2.7.4