> -----Original Message----- > From: Richardson, Bruce <bruce.richard...@intel.com> > Sent: Tuesday, April 5, 2022 5:38 PM > To: Ilya Maximets <i.maxim...@ovn.org>; Chengwen Feng > <fengcheng...@huawei.com>; Radha Mohan Chintakuntla <rad...@marvell.com>; > Veerasenareddy Burru <vbu...@marvell.com>; Gagandeep Singh > <g.si...@nxp.com>; Nipun Gupta <nipun.gu...@nxp.com> > Cc: Pai G, Sunil <sunil.pa...@intel.com>; Stokes, Ian > <ian.sto...@intel.com>; Hu, Jiayu <jiayu...@intel.com>; Ferriter, Cian > <cian.ferri...@intel.com>; Van Haaren, Harry <harry.van.haa...@intel.com>; > Maxime Coquelin (maxime.coque...@redhat.com) <maxime.coque...@redhat.com>; > ovs-...@openvswitch.org; dev@dpdk.org; Mcnamara, John > <john.mcnam...@intel.com>; O'Driscoll, Tim <tim.odrisc...@intel.com>; > Finn, Emma <emma.f...@intel.com> > Subject: Re: OVS DPDK DMA-Dev library/Design Discussion > > On Tue, Apr 05, 2022 at 01:29:25PM +0200, Ilya Maximets wrote: > > On 3/30/22 16:09, Bruce Richardson wrote: > > > On Wed, Mar 30, 2022 at 01:41:34PM +0200, Ilya Maximets wrote: > > >> On 3/30/22 13:12, Bruce Richardson wrote: > > >>> On Wed, Mar 30, 2022 at 12:52:15PM +0200, Ilya Maximets wrote: > > >>>> On 3/30/22 12:41, Ilya Maximets wrote: > > >>>>> Forking the thread to discuss a memory consistency/ordering model. > > >>>>> > > >>>>> AFAICT, dmadev can be anything from part of a CPU to a > > >>>>> completely separate PCI device. However, I don't see any memory > > >>>>> ordering being enforced or even described in the dmadev API or > documentation. > > >>>>> Please, point me to the correct documentation, if I somehow missed > it. > > >>>>> > > >>>>> We have a DMA device (A) and a CPU core (B) writing respectively > > >>>>> the data and the descriptor info. CPU core (C) is reading the > > >>>>> descriptor and the data it points too. > > >>>>> > > >>>>> A few things about that process: > > >>>>> > > >>>>> 1. There is no memory barrier between writes A and B (Did I miss > > >>>>> them?). Meaning that those operations can be seen by C in a > > >>>>> different order regardless of barriers issued by C and > regardless > > >>>>> of the nature of devices A and B. > > >>>>> > > >>>>> 2. Even if there is a write barrier between A and B, there is > > >>>>> no guarantee that C will see these writes in the same order > > >>>>> as C doesn't use real memory barriers because vhost > > >>>>> advertises > > >>>> > > >>>> s/advertises/does not advertise/ > > >>>> > > >>>>> VIRTIO_F_ORDER_PLATFORM. > > >>>>> > > >>>>> So, I'm getting to conclusion that there is a missing write > > >>>>> barrier on the vhost side and vhost itself must not advertise > > >>>>> the > > >>>> > > >>>> s/must not/must/ > > >>>> > > >>>> Sorry, I wrote things backwards. :) > > >>>> > > >>>>> VIRTIO_F_ORDER_PLATFORM, so the virtio driver can use actual > > >>>>> memory barriers. > > >>>>> > > >>>>> Would like to hear some thoughts on that topic. Is it a real > issue? > > >>>>> Is it an issue considering all possible CPU architectures and > > >>>>> DMA HW variants? > > >>>>> > > >>> > > >>> In terms of ordering of operations using dmadev: > > >>> > > >>> * Some DMA HW will perform all operations strictly in order e.g. > Intel > > >>> IOAT, while other hardware may not guarantee order of > operations/do > > >>> things in parallel e.g. Intel DSA. Therefore the dmadev API > provides the > > >>> fence operation which allows the order to be enforced. The fence > can be > > >>> thought of as a full memory barrier, meaning no jobs after the > barrier can > > >>> be started until all those before it have completed. Obviously, > for HW > > >>> where order is always enforced, this will be a no-op, but for > hardware that > > >>> parallelizes, we want to reduce the fences to get best > performance. > > >>> > > >>> * For synchronization between DMA devices and CPUs, where a CPU can > only > > >>> write after a DMA copy has been done, the CPU must wait for the > dma > > >>> completion to guarantee ordering. Once the completion has been > returned > > >>> the completed operation is globally visible to all cores. > > >> > > >> Thanks for explanation! Some questions though: > > >> > > >> In our case one CPU waits for completion and another CPU is > > >> actually using the data. IOW, "CPU must wait" is a bit ambiguous. > Which CPU must wait? > > >> > > >> Or should it be "Once the completion is visible on any core, the > > >> completed operation is globally visible to all cores." ? > > >> > > > > > > The latter. > > > Once the change to memory/cache is visible to any core, it is > > > visible to all ones. This applies to regular CPU memory writes too - > > > at least on IA, and I expect on many other architectures - once the > > > write is visible outside the current core it is visible to every > > > other core. Once the data hits the l1 or l2 cache of any core, any > > > subsequent requests for that data from any other core will "snoop" > > > the latest data from the cores cache, even if it has not made its > > > way down to a shared cache, e.g. l3 on most IA systems. > > > > It sounds like you're referring to the "multicopy atomicity" of the > > architecture. However, that is not universally supported thing. > > AFAICT, POWER and older ARM systems doesn't support it, so writes > > performed by one core are not necessarily available to all other cores > > at the same time. That means that if the CPU0 writes the data and the > > completion flag, CPU1 reads the completion flag and writes the ring, > > CPU2 may see the ring write, but may still not see the write of the > > data, even though there was a control dependency on CPU1. > > There should be a full memory barrier on CPU1 in order to fulfill the > > memory ordering requirements for CPU2, IIUC. > > > > In our scenario the CPU0 is a DMA device, which may or may not be part > > of a CPU and may have different memory consistency/ordering > > requirements. So, the question is: does DPDK DMA API guarantee > > multicopy atomicity between DMA device and all CPU cores regardless of > > CPU architecture and a nature of the DMA device? > > > > Right now, it doesn't because this never came up in discussion. In order > to be useful, it sounds like it explicitly should do so. At least for the > Intel ioat and idxd driver cases, this will be supported, so we just need > to ensure all other drivers currently upstreamed can offer this too. If > they cannot, we cannot offer it as a global guarantee, and we should see > about adding a capability flag for this to indicate when the guarantee is > there or not. > > Maintainers of dma/cnxk, dma/dpaa and dma/hisilicon - are we ok to > document for dmadev that once a DMA operation is completed, the op is > guaranteed visible to all cores/threads? If not, any thoughts on what > guarantees we can provide in this regard, or what capabilities should be > exposed?
Hi @Chengwen Feng, @Radha Mohan Chintakuntla, @Veerasenareddy Burru, @Gagandeep Singh, @Nipun Gupta, Requesting your valuable opinions for the queries on this thread. > > /Bruce