Re: [dpdk-dev] [PATCH v3 12/12] service: relax barriers with C11 atomic operations

Honnappa Nagarahalli Mon, 06 Apr 2020 10:07:02 -0700

<snip>
Just to get us on the same page on 'techniques to communicate data from writer 
to reader' (apologies if it is too trivial)


Let us say that the writer has 512B (key point is - cannot be written 
atomically) that needs to be communicated to the reader.

Since the data cannot be written atomically, we need a guard variable (which 
can be written atomically, can be a flag or pointer to data). So, the writer 
will store 512B in non-atomic way and write to guard variable with release 
memory order. So, if the guard variable is valid (set in the case of flag or 
not null in the case of pointer), it guarantees that 512B is written.

The reader will read the guard variable with acquire memory order and read the 
512B data only if the guard variable is valid. So, the acquire memory order on 
the guard variable guarantees that the load of 512B does not happen before the 
guard variable is read. The validity check on the guard variable guarantees 
that 512B was written before it was read.

The store(guard_variable, RELEASE) on the writer and the load(guard_variable, 
ACQUIRE) can be said as synchronizing with each other.

(the guard variable technique applies even if we are not using C11 atomics)

Let us say that the writer has 4B (key point is - can be written atomically) 
that needs to be communicated to the reader. The writer is free to write this 
atomically with no constraints on memory ordering as long as this data is not 
acting as a guard variable for any other data.

In my understanding, the sequence of APIs to call to start a service (writer) 
are as follows:
1) rte_service_init
2) rte_service_component_register
3) <possible configuration of the service>
4) rte_service_component_runstate_set (the reader is allowed at this point to 
read the information about the service - written by 
rte_service_component_register API. This API should not be called before 
rte_service_component_register)
5) <possible configuration of the service>
6) rte_service_runstate_set (the reader is allowed at this point to read the 
information about the service - written by rte_service_component_register API 
and run the service. This API can be called anytime. But, the reader should not 
attempt to run the service before this API is called)
7) rte_lcore_service_add (multiple of these probably, can be called before 
this, can't be called later)
8) rte_service_map_lcore_set (this can be called anytime. Can be called even if 
the service is not registered)
9) rte_service_lcore_start (again, this can be called anytime, even before the 
service is registered)

So, there are 2 guard variables - 'comp_runstate' and 'app_runstate'. Only 
these 2 need to have RELEASE ordering in writer and ACQUIRE ordering in reader.

We can write test cases with different orders of these API calls to prove that 
the memory orders we use are sufficient.

Few comments are inline based on this assessment.

> Subject: RE: [PATCH v3 12/12] service: relax barriers with C11 atomic
> operations
> 
> > From: Phil Yang <phil.y...@arm.com>
> > Sent: Tuesday, March 17, 2020 1:18 AM
> > To: tho...@monjalon.net; Van Haaren, Harry
> > <harry.van.haa...@intel.com>; Ananyev, Konstantin
> > <konstantin.anan...@intel.com>; step...@networkplumber.org;
> > maxime.coque...@redhat.com; dev@dpdk.org
> > Cc: david.march...@redhat.com; jer...@marvell.com;
> > hemant.agra...@nxp.com; honnappa.nagaraha...@arm.com;
> > gavin...@arm.com; ruifeng.w...@arm.com; joyce.k...@arm.com;
> n...@arm.com
> > Subject: [PATCH v3 12/12] service: relax barriers with C11 atomic
> > operations
> >
> > To guarantee the inter-threads visibility of the shareable domain, it
> > uses a lot of rte_smp_r/wmb in the service library. This patch relaxed
> > these barriers for service by using c11 atomic one-way barrier operations.
> >
> > Signed-off-by: Phil Yang <phil.y...@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.w...@arm.com>
> > Reviewed-by: Gavin Hu <gavin...@arm.com>
> > ---
> >  lib/librte_eal/common/rte_service.c | 45
> > ++++++++++++++++++++----------------
> > -
> >  1 file changed, 25 insertions(+), 20 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/rte_service.c
> > b/lib/librte_eal/common/rte_service.c
> > index c033224..d31663e 100644
> > --- a/lib/librte_eal/common/rte_service.c
> > +++ b/lib/librte_eal/common/rte_service.c
> > @@ -179,9 +179,11 @@ rte_service_set_stats_enable(uint32_t id, int32_t
> > enabled)
> >     SERVICE_VALID_GET_OR_ERR_RET(id, s, 0);
> >
> >     if (enabled)
> > -           s->internal_flags |= SERVICE_F_STATS_ENABLED;
> > +           __atomic_or_fetch(&s->internal_flags,
> SERVICE_F_STATS_ENABLED,
> > +                   __ATOMIC_RELEASE);
> >     else
> > -           s->internal_flags &= ~(SERVICE_F_STATS_ENABLED);
> > +           __atomic_and_fetch(&s->internal_flags,
> > +                   ~(SERVICE_F_STATS_ENABLED), __ATOMIC_RELEASE);
> 
> Not sure why these have to become stores with RELEASE memory ordering?
> (More occurances of same Q below, just answer here?)
Agree, 'internal_flags' is not acting as a guard variable, this should be 
RELAXED (similarly for the others below). Though I suggest keeping it atomic.

> 
> >     return 0;
> >  }
> > @@ -193,9 +195,11 @@ rte_service_set_runstate_mapped_check(uint32_t
> > id, int32_t enabled)
> >     SERVICE_VALID_GET_OR_ERR_RET(id, s, 0);
> >
> >     if (enabled)
> > -           s->internal_flags |= SERVICE_F_START_CHECK;
> > +           __atomic_or_fetch(&s->internal_flags,
> SERVICE_F_START_CHECK,
> > +                   __ATOMIC_RELEASE);
> >     else
> > -           s->internal_flags &= ~(SERVICE_F_START_CHECK);
> > +           __atomic_and_fetch(&s->internal_flags,
> ~(SERVICE_F_START_CHECK),
> > +                   __ATOMIC_RELEASE);
> 
> Same as above, why do these require RELEASE?
Agree

> 
> 
> Remainder of patch below seems to make sense - there's a wmb() involved
> hence RELEASE m/o.
> 
> >     return 0;
> >  }
> > @@ -264,8 +268,8 @@ rte_service_component_register(const struct
> > rte_service_spec *spec,
> >     s->spec = *spec;
> >     s->internal_flags |= SERVICE_F_REGISTERED |
> SERVICE_F_START_CHECK;
> >
> > -   rte_smp_wmb();
> > -   rte_service_count++;
> > +   /* make sure the counter update after the state change. */
> > +   __atomic_add_fetch(&rte_service_count, 1, __ATOMIC_RELEASE);
> 
> This makes sense to me - the RELEASE ensures that previous stores to the
> s->internal_flags are visible to other cores before rte_service_count
> increments atomically.
rte_service_count is not a guard variable, does not need RELEASE order. It is 
also not used by the reader. It looks like it is just a statistic being 
maintained.

> 
> 
> >     if (id_ptr)
> >             *id_ptr = free_slot;
> > @@ -281,9 +285,10 @@ rte_service_component_unregister(uint32_t id)
> >     SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
> >
> >     rte_service_count--;
> > -   rte_smp_wmb();
> >
> > -   s->internal_flags &= ~(SERVICE_F_REGISTERED);
> > +   /* make sure the counter update before the state change. */
> > +   __atomic_and_fetch(&s->internal_flags, ~(SERVICE_F_REGISTERED),
> > +                      __ATOMIC_RELEASE);
RELAXED is enough.

> >
> >     /* clear the run-bit in all cores */
> >     for (i = 0; i < RTE_MAX_LCORE; i++)
> > @@ -301,11 +306,12 @@ rte_service_component_runstate_set(uint32_t id,
> > uint32_t
> > runstate)
> >     SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
> >
> >     if (runstate)
> > -           s->comp_runstate = RUNSTATE_RUNNING;
> > +           __atomic_store_n(&s->comp_runstate, RUNSTATE_RUNNING,
> > +                           __ATOMIC_RELEASE);
> >     else
> > -           s->comp_runstate = RUNSTATE_STOPPED;
> > +           __atomic_store_n(&s->comp_runstate, RUNSTATE_STOPPED,
> > +                           __ATOMIC_RELEASE);
Here we need a thread_fence to prevent the memory operations from a subsequent 
call to 'rte_service_component_unregister' from getting hoisted above this. The 
user should be forced to call rte_service_component_unregister before calling 
rte_service_component_runstate_set. I suggest adding a check in 
rte_service_component_unregister to ensure that the state is set to 
RUNSTATE_STOPPED. In fact, the user needs to make sure that the service is 
stopped for sure before calling rte_service_component_unregister.

> >
> > -   rte_smp_wmb();
> >     return 0;
> >  }
> >
> >
> > @@ -316,11 +322,12 @@ rte_service_runstate_set(uint32_t id, uint32_t
> runstate)
> >     SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
> >
> >     if (runstate)
> > -           s->app_runstate = RUNSTATE_RUNNING;
> > +           __atomic_store_n(&s->app_runstate, RUNSTATE_RUNNING,
> > +                           __ATOMIC_RELEASE);
> >     else
> > -           s->app_runstate = RUNSTATE_STOPPED;
> > +           __atomic_store_n(&s->app_runstate, RUNSTATE_STOPPED,
> > +                           __ATOMIC_RELEASE);
> >
> > -   rte_smp_wmb();
> >     return 0;
> >  }
> >
> > @@ -442,7 +449,8 @@ service_runner_func(void *arg)
> >     const int lcore = rte_lcore_id();
> >     struct core_state *cs = &lcore_states[lcore];
> >
> > -   while (lcore_states[lcore].runstate == RUNSTATE_RUNNING) {
> > +   while (__atomic_load_n(&cs->runstate,
> > +               __ATOMIC_ACQUIRE) == RUNSTATE_RUNNING) {
This can be RELAXED, lcore's runstate is not acting as a guard variable.
However, note that the writer thread wants to communicate the 'runstate' (4B) 
to the reader thread. This ordering needs to be handled in 
'rte_eal_remote_launch' and 'eal_thread_loop' functions. We have to note that 
in some other use case, the writer wants to communicate more than 4B to reader. 
Currently, the 'write' and 'read' system calls may have enough barriers to make 
things work fine. But, I suggest using the ' lcore_config[lcore].f' as the 
guard variable to make it explicit and not depend on 'write' and 'read'. We can 
take up the EAL things in a later patch as it does not cause any issues right 
now.

> >             const uint64_t service_mask = cs->service_mask;
> >
> >             for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) { @@ -453,8
> +461,6 @@
> > service_runner_func(void *arg)
> >             }
> >
> >             cs->loops++;
> > -
> > -           rte_smp_rmb();
> >     }
> >
> >     lcore_config[lcore].state = WAIT;
> > @@ -663,9 +669,8 @@ rte_service_lcore_add(uint32_t lcore)
> >
> >     /* ensure that after adding a core the mask and state are defaults */
> >     lcore_states[lcore].service_mask = 0;
> > -   lcore_states[lcore].runstate = RUNSTATE_STOPPED;
> > -
> > -   rte_smp_wmb();
Agree.

> > +   __atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
> > +                   __ATOMIC_RELEASE);
This can be relaxed.

> >
> >     return rte_eal_wait_lcore(lcore);
> >  }
> > --
> > 2.7.4

Re: [dpdk-dev] [PATCH v3 12/12] service: relax barriers with C11 atomic operations

Reply via email to