> -----Original Message----- > From: Ola Liljedahl [mailto:ola.liljed...@arm.com] > Sent: Monday, January 28, 2019 4:36 AM > To: jer...@marvell.com; mcze...@marvell.com; Eads, Gage > <gage.e...@intel.com>; dev@dpdk.org > Cc: olivier.m...@6wind.com; step...@networkplumber.org; nd > <n...@arm.com>; Richardson, Bruce <bruce.richard...@intel.com>; > arybche...@solarflare.com; Ananyev, Konstantin > <konstantin.anan...@intel.com> > Subject: Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation > > On Fri, 2019-01-25 at 17:21 +0000, Eads, Gage wrote: > > > > > > > > -----Original Message----- > > > From: Ola Liljedahl [mailto:ola.liljed...@arm.com] > > > Sent: Wednesday, January 23, 2019 4:16 AM > > > To: Eads, Gage <gage.e...@intel.com>; dev@dpdk.org > > > Cc: olivier.m...@6wind.com; step...@networkplumber.org; nd > > > <n...@arm.com>; Richardson, Bruce <bruce.richard...@intel.com>; > > > arybche...@solarflare.com; Ananyev, Konstantin > > > <konstantin.anan...@intel.com> > > > Subject: Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking > > > implementation > > > > > > On Tue, 2019-01-22 at 21:31 +0000, Eads, Gage wrote: > > > > > > > > Hi Ola, > > > > > > > > <snip> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > @@ -331,6 +433,319 @@ void rte_ring_dump(FILE *f, const struct > > > > > > rte_ring *r); > > > > > > #endif > > > > > > #include "rte_ring_generic_64.h" > > > > > > > > > > > > +/* @internal 128-bit structure used by the non-blocking ring > > > > > > +*/ struct nb_ring_entry { > > > > > > + void *ptr; /**< Data pointer */ > > > > > > + uint64_t cnt; /**< Modification counter */ > > > > > Why not make 'cnt' uintptr_t? This way 32-bit architectures will > > > > > also be supported. I think there are some claims that DPDK still > > > > > supports e.g. > > > > > ARMv7a > > > > > and possibly also 32-bit x86? > > > > I chose a 64-bit modification counter because (practically > > > > speaking) the ABA problem will not occur with such a large counter > > > > -- definitely not within my lifetime. See the "Discussion" section > > > > of the commit message for more information. > > > > > > > > With a 32-bit counter, there is a very (very) low likelihood of > > > > it, but it is possible. Personally, I don't feel comfortable > > > > providing such code, because a) I doubt all users would understand > > > > the implementation well enough to do the risk/reward analysis, and > > > > b) such a bug would be near impossible to reproduce and root-cause > > > > if it did occur. > > > With a 64-bit counter (and 32-bit pointer), 32-bit architectures (e.g. > > > ARMv7a and > > > probably x86 as well) won't be able to support this as they at best > > > support 64-bit CAS (ARMv7a has LDREXD/STREXD). So you are > > > essentially putting a 64-bit (and 128-bit CAS) requirement on the > > > implementation. > > > > > Yes, I am. I tried to make that clear in the cover letter. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > +}; > > > > > > + > > > > > > +/* The non-blocking ring algorithm is based on the original > > > > > > +rte ring (derived > > > > > > + * from FreeBSD's bufring.h) and inspired by Michael and > > > > > > +Scott's non-blocking > > > > > > + * concurrent queue. > > > > > > + */ > > > > > > + > > > > > > +/** > > > > > > + * @internal > > > > > > + * Enqueue several objects on the non-blocking ring > > > > > > +(single-producer only) > > > > > > + * > > > > > > + * @param r > > > > > > + * A pointer to the ring structure. > > > > > > + * @param obj_table > > > > > > + * A pointer to a table of void * pointers (objects). > > > > > > + * @param n > > > > > > + * The number of objects to add in the ring from the obj_table. > > > > > > + * @param behavior > > > > > > + * RTE_RING_QUEUE_FIXED: Enqueue a fixed number of items > > > > > > +to the ring > > > > > > + * RTE_RING_QUEUE_VARIABLE: Enqueue as many items as > > > > > > +possible to the ring > > > > > > + * @param free_space > > > > > > + * returns the amount of space after the enqueue operation > > > > > > +has finished > > > > > > + * @return > > > > > > + * Actual number of objects enqueued. > > > > > > + * If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only. > > > > > > + */ > > > > > > +static __rte_always_inline unsigned int > > > > > > +__rte_ring_do_nb_enqueue_sp(struct rte_ring *r, void * const > > > > > > *obj_table, > > > > > > + unsigned int n, > > > > > > + enum rte_ring_queue_behavior behavior, > > > > > > + unsigned int *free_space) { > > > > > > + uint32_t free_entries; > > > > > > + size_t head, next; > > > > > > + > > > > > > + n = __rte_ring_move_prod_head_64(r, 1, n, behavior, > > > > > > + &head, &next, > > > > > > &free_entries); > > > > > > + if (n == 0) > > > > > > + goto end; > > > > > > + > > > > > > + ENQUEUE_PTRS_NB(r, &r[1], head, obj_table, n); > > > > > > + > > > > > > + r->prod_64.tail += n; > > > > > Don't we need release order when (or smp_wmb between) writing of > > > > > the ring pointers and the update of tail? By updating the tail > > > > > pointer, we are synchronising with a consumer. > > > > > > > > > > I prefer using __atomic operations even for load and store. You > > > > > can see which parts of the code that synchronise with each other, e.g. > > > > > store-release to some location synchronises with load-acquire > > > > > from the same location. If you don't know how different threads > > > > > synchronise with each other, you are very likely to make mistakes. > > > > > > > > > You can tell this code was written when I thought x86-64 was the > > > > only viable target :). Yes, you are correct. > > > > > > > > With regards to using __atomic intrinsics, I'm planning on taking > > > > a similar approach to the functions duplicated in > > > > rte_ring_generic.h and > > > > rte_ring_c11_mem.h: one version that uses rte_atomic functions > > > > (and thus stricter memory ordering) and one that uses __atomic > > > > intrinsics (and thus can benefit from more relaxed memory ordering). > From a code point of view, I strongly prefer the atomic operations to be > visible > in the top level code, not hidden in subroutines. For correctness, it is > vital that > memory accesses are performed with the required ordering and that acquire and > release matches up. Hiding e.g. load-acquire and store-release in subroutines > (in > a different file!) make this difficult. There have already been such bugs > found in > rte_ring. >
After working on the acq/rel ordering this weekend, I agree. This'll be easier/cleaner if we end up only using the C11 version. > > > What's the advantage of having two different implementations? What > > > is the disadvantage? > > > > > > The existing ring buffer code originally had only the "legacy" > > > implementation > > > which was kept when the __atomic implementation was added. The > > > reason claimed was that some older compilers for x86 do not support > > > GCC __atomic builtins. But I thought there was consensus that new > > > functionality could have only __atomic implementations. > > > > > When CONFIG_RTE_RING_USE_C11_MEM_MODEL was introduced, it was left > > disabled for thunderx[1] for performance reasons. Assuming that hasn't > > changed, the advantage to having two versions is to best support all of > > DPDK's > platforms. > > The disadvantage is of course duplicated code and the additional > > maintenance burden. > The only way I see that a C11 memory model implementation can be slower > than using smp_wmb/rmb is if you need to order loads before a synchronizing > store and there are also outstanding stores which do not require ordering. > smp_rmb() handles this while store-release will also (unnecessarily) order > those > outstanding stores. This situation occurs e.g. in ring buffer dequeue > operations > where ring slots are read (and possibly written to thread-private memory) > before > the ring slots are release (e.g. using CAS-release or store-release). > > I imagine that the LSU/cache subsystem on ThunderX/OCTEON-TX also have > something to do with this problem. If there are a large amounts of stores > pending in the load/store unit, store-release might have to wait for a long > time > before the synchronizing store can complete. > > > > > That said, if the thunderx maintainers are ok with it, I'm certainly > > open to only doing the __atomic version. Note that even in the > > __atomic version, based on Honnapa's findings[2], using a DPDK-defined > > rte_atomic128_cmpset() (with additional arguments to support machines > > with weak consistency) appears to be a better option than > __atomic_compare_exchange_16. > __atomic_compare_exchange_16() is not guaranteed to be lock-free. It is not > lock-free on ARM/AArch64 and the support in GCC is formally broken (can't use > cmpexchg16b to implement __atomic_load_16). > > So yes, I think DPDK will have to define and implement the 128-bit atomic > compare and exchange operation (whatever it will be called). For compatibility > with ARMv8.0, we can't require the "old" value returned by a failed compare- > exchange operation to be read atomically (LDXP does not guaranteed atomicity > by itself). But this is seldom a problem, many designs read the memory > location > using two separate 64-bit loads (so not atomic) anyway, it is a successful > atomic > compare exchange operation which provides atomicity. > Ok. I agree, I don't expect that to be a problem. The 128-bit CAS patch I just submitted[1] (which was developed before reading this) will have to be changed. [1] http://mails.dpdk.org/archives/dev/2019-January/124159.html Thanks, Gage </snip>