On 4/11/20 6:46 AM, Gavin Hu wrote: > Hi Andrew, > >> -----Original Message----- >> From: Andrew Rybchenko <arybche...@solarflare.com> >> Sent: Saturday, April 11, 2020 1:21 AM >> To: Gavin Hu <gavin...@arm.com>; dev@dpdk.org >> Cc: nd <n...@arm.com>; david.march...@redhat.com; >> tho...@monjalon.net; rasl...@mellanox.com; d...@linux.vnet.ibm.com; >> bruce.richard...@intel.com; konstantin.anan...@intel.com; >> ma...@mellanox.com; shah...@mellanox.com; viachesl...@mellanox.com; >> jer...@marvell.com; Honnappa Nagarahalli >> <honnappa.nagaraha...@arm.com>; Ruifeng Wang >> <ruifeng.w...@arm.com>; Phil Yang <phil.y...@arm.com>; Joyce Kong >> <joyce.k...@arm.com>; Steve Capper <steve.cap...@arm.com> >> Subject: Re: [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class and >> use it for mlx5 PMD >> >> On 4/10/20 7:41 PM, Gavin Hu wrote: >>> To order writes to various memory types, 'sfence' is required for x86, >>> and 'dmb oshst' is required for aarch64. >>> >>> But within DPDK, there is no abstracted barriers covers this >>> combination: sfence(x86)/dmb(aarch64). >>> >>> So introduce a new barrier class - rte_dma_*mb for this combination, >>> >>> Doorbell rings are typical use cases of this new barrier class, which >>> requires something ready in the memory before letting HW aware. >>> >>> As a note, rte_io_wmb and rte_cio_wmb are compiler barriers for x86, >> while >>> rte_wmb is 'dsb' for aarch64. >> >> As far as I can see rte_cio_wmb() is exactly definition of the barrier >> to be used for doorbells. Am I missing something? > > I understand rte_cio_wmb is for DMA buffers, for examples, descriptors, work > queues, located in the host memory, but shared between CPU and IO device. > rte_io_wmb is for MMIO regions. > We are missing the barriers for various memory types, eg. Doorbell cases.
When the patch series is applied, we'll have 5 types of memory barriers: regular, smp, cio, io, dma. Do we really need so many? May be we need a table in description which could help to make the right choice. I.e. type of access on both axis and type of barrier to use on intersection. > There is an implication in the definition of rte_cio_wmb, it can not be used > for non-coherent MMIO region(WC?) > http://code.dpdk.org/dpdk/v20.02/source/lib/librte_eal/common/include/generic/rte_atomic.h#L124 >> May be it is just a bug in rte_cio_wmb() on x86? > rte_cio_wmb is ok for doorbells on aarch64, but looking through the kernel > code, 'sfence' is required for various/mixed memory types. > DPDK mlx5 PMD uses rte_cio_wmb widely and wisely, it orders sequences of > writes to host memory that shared by IO device. > Strengthening rte_cio_wmb may hurt performance, so a new barrier class is > introduced to optimize for aarch64, in the fast path only, while not > impacting x86. > http://code.dpdk.org/dpdk/v20.02/source/drivers/net/mlx5/mlx5_rxtx.c#L1087 May be my problem that I don't fully understand real-life usecases when cio should be used in accordance with its current definition. Does it make sense without doorbell? Does HW polling via DMA? Thanks for explanations, Andrew. >> >>> In the joint preliminary testing between Arm and Ampere, 8%~13% >>> performance boost was measured. >>> >>> As there is no functionality changes, it will not impact x86. >>> >>> Gavin Hu (6): >>> eal: introduce new class of barriers for DMA use cases >>> net/mlx5: dmb for immediate doorbell ring on aarch64 >>> net/mlx5: relax barrier to order UAR writes on aarch64 >>> net/mlx5: relax barrier for aarch64 >>> net/mlx5: add descriptive comment for a barrier >>> doc: clarify one configuration in mlx5 guide >>> >>> Phil Yang (1): >>> net/mlx5: relax ordering for multi-packet RQ buffer refcnt >>> >>> doc/guides/nics/mlx5.rst | 6 ++-- >>> drivers/net/mlx5/mlx5_rxq.c | 2 +- >>> drivers/net/mlx5/mlx5_rxtx.c | 16 ++++++----- >>> drivers/net/mlx5/mlx5_rxtx.h | 14 ++++++---- >>> lib/librte_eal/arm/include/rte_atomic_32.h | 6 ++++ >>> lib/librte_eal/arm/include/rte_atomic_64.h | 6 ++++ >>> lib/librte_eal/include/generic/rte_atomic.h | 31 +++++++++++++++++++++ >>> lib/librte_eal/ppc/include/rte_atomic.h | 6 ++++ >>> lib/librte_eal/x86/include/rte_atomic.h | 6 ++++ >>> 9 files changed, 78 insertions(+), 15 deletions(-) >>> >