> -----Original Message----- > From: Jerin Jacob <jerinjac...@gmail.com> > Sent: Tuesday, May 12, 2020 2:42 PM > To: Ruifeng Wang <ruifeng.w...@arm.com> > Cc: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>; > dev@dpdk.org; jer...@marvell.com; hemant.agra...@nxp.com; Ajit > Khaparde (ajit.khapa...@broadcom.com) <ajit.khapa...@broadcom.com>; > igo...@amazon.com; tho...@monjalon.net; viachesl...@mellanox.com; > arybche...@solarflare.com; nd <n...@arm.com> > Subject: Re: [dpdk-dev] [RFC] eal: adjust barriers for IO on Armv8-a > > On Tue, May 12, 2020 at 11:48 AM Ruifeng Wang <ruifeng.w...@arm.com> > wrote: > > > > > > > -----Original Message----- > > > From: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com> > > > Sent: Tuesday, May 12, 2020 2:07 AM > > > To: dev@dpdk.org; jer...@marvell.com; hemant.agra...@nxp.com; Ajit > > > Khaparde (ajit.khapa...@broadcom.com) > <ajit.khapa...@broadcom.com>; > > > igo...@amazon.com; tho...@monjalon.net; > viachesl...@mellanox.com; > > > arybche...@solarflare.com; Honnappa Nagarahalli > > > <honnappa.nagaraha...@arm.com> > > > Cc: Ruifeng Wang <ruifeng.w...@arm.com>; nd <n...@arm.com> > > > Subject: [RFC] eal: adjust barriers for IO on Armv8-a > > > > > > Change the barrier APIs for IO to reflect that Armv8-a is > > > other-multi-copy atomicity memory model. > > > > > > Armv8-a memory model has been strengthened to require > > > other-multi-copy atomicity. This property requires memory accesses > > > from an observer to become visible to all other observers > > > simultaneously [3]. This means > > > > > > a) A write arriving at an endpoint shared between multiple CPUs is > > > visible to all CPUs > > > b) A write that is visible to all CPUs is also visible to all other > > > observers in the shareability domain > > > > > > This allows for using cheaper DMB instructions in the place of DSB > > > for devices that are visible to all CPUs (i.e. devices that DPDK caters > > > to). > > > > > > Please refer to [1], [2] and [3] for more information. > > > > > > [1] > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/c > > > ommit/?i d=22ec71615d824f4f11d38d0e55a88d8956b7e45f > > > [2] https://www.youtube.com/watch?v=i6DayghhA8Q > > > [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/ > > > > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com> > > > --- > > > lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++----- > > > 1 file changed, 5 insertions(+), 5 deletions(-) > > > > > > diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h > > > b/lib/librte_eal/arm/include/rte_atomic_64.h > > > index 7b7099cdc..e406411bb 100644 > > > --- a/lib/librte_eal/arm/include/rte_atomic_64.h > > > +++ b/lib/librte_eal/arm/include/rte_atomic_64.h > > > @@ -19,11 +19,11 @@ extern "C" { > > > #include <rte_compat.h> > > > #include <rte_debug.h> > > > > > > -#define rte_mb() asm volatile("dsb sy" : : : "memory") > > > +#define rte_mb() asm volatile("dmb osh" : : : "memory") > > > > > > -#define rte_wmb() asm volatile("dsb st" : : : "memory") > > > +#define rte_wmb() asm volatile("dmb oshst" : : : "memory") > > > > > > -#define rte_rmb() asm volatile("dsb ld" : : : "memory") > > > +#define rte_rmb() asm volatile("dmb oshld" : : : "memory") > > > > > > #define rte_smp_mb() asm volatile("dmb ish" : : : "memory") > > > > > > @@ -37,9 +37,9 @@ extern "C" { > > > > > > #define rte_io_rmb() rte_rmb() > > > > > > -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory") > > > +#define rte_cio_wmb() rte_wmb() > > > > > > -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory") > > > +#define rte_cio_rmb() rte_rmb() > > > > > > /*------------------------ 128 bit atomic operations > > > -------------------------*/ > > > > > > -- > > > 2.17.1 > > > > This change showed about 7% performance gain in testpmd single core > NDR test. > > I am trying to understand this patch wrt DPDK current usage model? > > 1) Is performance improvement due to the fact that the PMD that you are > using it for testing suppose to use existing rte_cio_* but it was using > rte_[rw]mb?
This is part of the reason. There are also cases where rte_io_* was used and can be relaxed. Such as: http://patches.dpdk.org/patch/68162/ > 2) In my understanding : > a) CPU to CPU barrier requirements are addressed by rte_smp_* > b) CPU to DMA/Device barrier requirements are addressed by rte_cio_* > c) CPU to ANY(CPU or Device) are addressed by rte_[rw]mb > > If (c) is true then we are violating the DPDK spec with change. Right? Developers are still required to use correct barrier APIs for different use cases. I think this change mitigates performance penalty when non optimal barrier is used. > This change will not be required if fastpath (CPU to Device) is using > rte_cio_*. > Right? See 1). Correct usage of rte_cio_* is not the whole. For some other use cases, such as barrier between accesses of different memory types, we can also use lighter barrier 'dmb'. > > > > > Tested-by: Ruifeng Wang <ruifeng.w...@arm.com> > >