On Fri, 2019-01-25 at 17:21 +0000, Eads, Gage wrote: > > > > > -----Original Message----- > > From: Ola Liljedahl [mailto:ola.liljed...@arm.com] > > Sent: Wednesday, January 23, 2019 4:16 AM > > To: Eads, Gage <gage.e...@intel.com>; dev@dpdk.org > > Cc: olivier.m...@6wind.com; step...@networkplumber.org; nd > > <n...@arm.com>; Richardson, Bruce <bruce.richard...@intel.com>; > > arybche...@solarflare.com; Ananyev, Konstantin > > <konstantin.anan...@intel.com> > > Subject: Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking > > implementation > > > > On Tue, 2019-01-22 at 21:31 +0000, Eads, Gage wrote: > > > > > > Hi Ola, > > > > > > <snip> > > > > > > > > > > > > > > > > > > > > > > > > > > @@ -331,6 +433,319 @@ void rte_ring_dump(FILE *f, const struct > > > > > rte_ring *r); > > > > > #endif > > > > > #include "rte_ring_generic_64.h" > > > > > > > > > > +/* @internal 128-bit structure used by the non-blocking ring */ > > > > > +struct nb_ring_entry { > > > > > + void *ptr; /**< Data pointer */ > > > > > + uint64_t cnt; /**< Modification counter */ > > > > Why not make 'cnt' uintptr_t? This way 32-bit architectures will > > > > also be supported. I think there are some claims that DPDK still > > > > supports e.g. > > > > ARMv7a > > > > and possibly also 32-bit x86? > > > I chose a 64-bit modification counter because (practically speaking) > > > the ABA problem will not occur with such a large counter -- definitely > > > not within my lifetime. See the "Discussion" section of the commit > > > message for more information. > > > > > > With a 32-bit counter, there is a very (very) low likelihood of it, > > > but it is possible. Personally, I don't feel comfortable providing > > > such code, because a) I doubt all users would understand the > > > implementation well enough to do the risk/reward analysis, and b) such > > > a bug would be near impossible to reproduce and root-cause if it did > > > occur. > > With a 64-bit counter (and 32-bit pointer), 32-bit architectures (e.g. > > ARMv7a and > > probably x86 as well) won't be able to support this as they at best support > > 64-bit > > CAS (ARMv7a has LDREXD/STREXD). So you are essentially putting a 64-bit (and > > 128-bit CAS) requirement on the implementation. > > > Yes, I am. I tried to make that clear in the cover letter. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > +}; > > > > > + > > > > > +/* The non-blocking ring algorithm is based on the original rte > > > > > +ring (derived > > > > > + * from FreeBSD's bufring.h) and inspired by Michael and Scott's > > > > > +non-blocking > > > > > + * concurrent queue. > > > > > + */ > > > > > + > > > > > +/** > > > > > + * @internal > > > > > + * Enqueue several objects on the non-blocking ring > > > > > +(single-producer only) > > > > > + * > > > > > + * @param r > > > > > + * A pointer to the ring structure. > > > > > + * @param obj_table > > > > > + * A pointer to a table of void * pointers (objects). > > > > > + * @param n > > > > > + * The number of objects to add in the ring from the obj_table. > > > > > + * @param behavior > > > > > + * RTE_RING_QUEUE_FIXED: Enqueue a fixed number of items to > > > > > +the ring > > > > > + * RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible > > > > > +to the ring > > > > > + * @param free_space > > > > > + * returns the amount of space after the enqueue operation has > > > > > +finished > > > > > + * @return > > > > > + * Actual number of objects enqueued. > > > > > + * If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only. > > > > > + */ > > > > > +static __rte_always_inline unsigned int > > > > > +__rte_ring_do_nb_enqueue_sp(struct rte_ring *r, void * const > > > > > *obj_table, > > > > > + unsigned int n, > > > > > + enum rte_ring_queue_behavior behavior, > > > > > + unsigned int *free_space) > > > > > +{ > > > > > + uint32_t free_entries; > > > > > + size_t head, next; > > > > > + > > > > > + n = __rte_ring_move_prod_head_64(r, 1, n, behavior, > > > > > + &head, &next, > > > > > &free_entries); > > > > > + if (n == 0) > > > > > + goto end; > > > > > + > > > > > + ENQUEUE_PTRS_NB(r, &r[1], head, obj_table, n); > > > > > + > > > > > + r->prod_64.tail += n; > > > > Don't we need release order when (or smp_wmb between) writing of the > > > > ring pointers and the update of tail? By updating the tail pointer, > > > > we are synchronising with a consumer. > > > > > > > > I prefer using __atomic operations even for load and store. You can > > > > see which parts of the code that synchronise with each other, e.g. > > > > store-release to some location synchronises with load-acquire from > > > > the same location. If you don't know how different threads > > > > synchronise with each other, you are very likely to make mistakes. > > > > > > > You can tell this code was written when I thought x86-64 was the only > > > viable target :). Yes, you are correct. > > > > > > With regards to using __atomic intrinsics, I'm planning on taking a > > > similar approach to the functions duplicated in rte_ring_generic.h and > > > rte_ring_c11_mem.h: one version that uses rte_atomic functions (and > > > thus stricter memory ordering) and one that uses __atomic intrinsics > > > (and thus can benefit from more relaxed memory ordering). From a code point of view, I strongly prefer the atomic operations to be visible in the top level code, not hidden in subroutines. For correctness, it is vital that memory accesses are performed with the required ordering and that acquire and release matches up. Hiding e.g. load-acquire and store-release in subroutines (in a different file!) make this difficult. There have already been such bugs found in rte_ring.
> > What's the advantage of having two different implementations? What is the > > disadvantage? > > > > The existing ring buffer code originally had only the "legacy" > > implementation > > which was kept when the __atomic implementation was added. The reason > > claimed was that some older compilers for x86 do not support GCC __atomic > > builtins. But I thought there was consensus that new functionality could > > have > > only __atomic implementations. > > > When CONFIG_RTE_RING_USE_C11_MEM_MODEL was introduced, it was left disabled > for thunderx[1] for performance reasons. Assuming that hasn't changed, the > advantage to having two versions is to best support all of DPDK's platforms. > The disadvantage is of course duplicated code and the additional maintenance > burden. The only way I see that a C11 memory model implementation can be slower than using smp_wmb/rmb is if you need to order loads before a synchronizing store and there are also outstanding stores which do not require ordering. smp_rmb() handles this while store-release will also (unnecessarily) order those outstanding stores. This situation occurs e.g. in ring buffer dequeue operations where ring slots are read (and possibly written to thread-private memory) before the ring slots are release (e.g. using CAS-release or store-release). I imagine that the LSU/cache subsystem on ThunderX/OCTEON-TX also have something to do with this problem. If there are a large amounts of stores pending in the load/store unit, store-release might have to wait for a long time before the synchronizing store can complete. > > That said, if the thunderx maintainers are ok with it, I'm certainly open to > only doing the __atomic version. Note that even in the __atomic version, based > on Honnapa's findings[2], using a DPDK-defined rte_atomic128_cmpset() (with > additional arguments to support machines with weak consistency) appears to be > a better option than __atomic_compare_exchange_16. __atomic_compare_exchange_16() is not guaranteed to be lock-free. It is not lock-free on ARM/AArch64 and the support in GCC is formally broken (can't use cmpexchg16b to implement __atomic_load_16). So yes, I think DPDK will have to define and implement the 128-bit atomic compare and exchange operation (whatever it will be called). For compatibility with ARMv8.0, we can't require the "old" value returned by a failed compare- exchange operation to be read atomically (LDXP does not guaranteed atomicity by itself). But this is seldom a problem, many designs read the memory location using two separate 64-bit loads (so not atomic) anyway, it is a successful atomic compare exchange operation which provides atomicity. > > I couldn't find the discussion about new functionality using __atomic going > forward -- can you send a link? > > [1] https://mails.dpdk.org/archives/dev/2017-December/082853.html > [2] http://mails.dpdk.org/archives/dev/2019-January/124002.html > > > > > Does the non-blocking ring buffer implementation have to support these older > > compilers? Will the applications that require these older compiler be > > updated to > > utilise the non-blocking ring buffer? > > > (See above -- compiler versions wasn't a consideration here.) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > + > > > > > +end: > > > > > + if (free_space != NULL) > > > > > + *free_space = free_entries - n; > > > > > + return n; > > > > > +} > > > > > + > > > > > +/** > > > > > + * @internal > > > > > + * Enqueue several objects on the non-blocking ring > > > > > +(multi-producer > > > > > +safe) > > > > > + * > > > > > + * @param r > > > > > + * A pointer to the ring structure. > > > > > + * @param obj_table > > > > > + * A pointer to a table of void * pointers (objects). > > > > > + * @param n > > > > > + * The number of objects to add in the ring from the obj_table. > > > > > + * @param behavior > > > > > + * RTE_RING_QUEUE_FIXED: Enqueue a fixed number of items to > > > > > +the ring > > > > > + * RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible > > > > > +to the ring > > > > > + * @param free_space > > > > > + * returns the amount of space after the enqueue operation has > > > > > +finished > > > > > + * @return > > > > > + * Actual number of objects enqueued. > > > > > + * If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only. > > > > > + */ > > > > > +static __rte_always_inline unsigned int > > > > > +__rte_ring_do_nb_enqueue_mp(struct rte_ring *r, void * const > > *obj_table, > > > > > > > > > > > > > > > > > + unsigned int n, > > > > > + enum rte_ring_queue_behavior behavior, > > > > > + unsigned int *free_space) > > > > > +{ > > > > > +#if !defined(RTE_ARCH_X86_64) || !defined(ALLOW_EXPERIMENTAL_API) > > > > > + RTE_SET_USED(r); > > > > > + RTE_SET_USED(obj_table); > > > > > + RTE_SET_USED(n); > > > > > + RTE_SET_USED(behavior); > > > > > + RTE_SET_USED(free_space); > > > > > +#ifndef ALLOW_EXPERIMENTAL_API > > > > > + printf("[%s()] RING_F_NB requires an experimental API." > > > > > + " Recompile with ALLOW_EXPERIMENTAL_API to use it.\n" > > > > > + , __func__); > > > > > +#endif > > > > > + return 0; > > > > > +#endif > > > > > +#if defined(RTE_ARCH_X86_64) && defined(ALLOW_EXPERIMENTAL_API) > > > > > + size_t head, next, tail; > > > > > + uint32_t free_entries; > > > > > + unsigned int i; > > > > > + > > > > > + n = __rte_ring_move_prod_head_64(r, 0, n, behavior, > > > > > + &head, &next, > > > > > &free_entries); > > > > > + if (n == 0) > > > > > + goto end; > > > > > + > > > > > + for (i = 0; i < n; /* i incremented if enqueue succeeds */) { > > > > > + struct nb_ring_entry old_value, new_value; > > > > > + struct nb_ring_entry *ring_ptr; > > > > > + > > > > > + /* Enqueue to the tail entry. If another thread wins > > > > > the > > > > > race, > > > > > + * retry with the new tail. > > > > > + */ > > > > > + tail = r->prod_64.tail; > > > > > + > > > > > + ring_ptr = &((struct nb_ring_entry *)&r[1])[tail & r- > > > > > > > > > > > > mask]; > > > > This is an ugly expression and cast. Also I think it is unnecessary. > > > > What's preventing this from being written without a cast? Perhaps > > > > the ring array needs to be a union of "void *" and struct > > > > nb_ring_entry? > > > The cast is necessary for the correct pointer arithmetic (let > > > "uintptr_t base == &r[1]"): > > Yes I know the C language. > > > > > > > > - With cast: ring_ptr = base + sizeof(struct nb_ring_entry) * (tail & > > > r- > > > > > > > > mask); > > > - W/o cast: ring_ptr = base + sizeof(struct rte_ring) * (tail & > > > r->mask); > > > > > > FWIW, this is essentially the same as is done with the second argument > > > (&r[1]) to ENQUEUE_PTRS and DEQUEUE_PTRS, but there it's split across > > > multiple lines of code. The equivalent here would be: > > > > > > struct nb_ring_entry *ring_base = (struct nb_ring_entry*)&r[1]; > > > ring_ptr = ring_base[tail & r->mask]; > > > > > > Which is more legible, I think. > > The RTE ring buffer code is not very legible to start with. > > > > > > > > > > > There is no ring array structure in which to add a union; the ring > > > array is a contiguous chunk of memory that immediately follows after > > > the end of a struct rte_ring. We interpret the memory there according > > > to the ring entry data type (void * for regular rings and struct > > > nb_ring_entry for > > non-blocking rings). > > My worry is that we are abusing the C language and creating a monster of > > fragile C code that will be more and more difficult to understand and to > > maintain. At some point you have to think the question "Are we doing the > > right > > thing?". > > > I'm not aware of any fragility/maintainability issues in the ring code (though > perhaps the maintainers have a different view!), and personally I find the > code fairly legible. If you have a specific suggestion, I'll look into > incorporating it. > > Thanks, > Gage > > </snip> -- Ola Liljedahl, Networking System Architect, Arm Phone +46706866373, Skype ola.liljedahl