On 3/30/22 13:12, Bruce Richardson wrote: > On Wed, Mar 30, 2022 at 12:52:15PM +0200, Ilya Maximets wrote: >> On 3/30/22 12:41, Ilya Maximets wrote: >>> Forking the thread to discuss a memory consistency/ordering model. >>> >>> AFAICT, dmadev can be anything from part of a CPU to a completely >>> separate PCI device. However, I don't see any memory ordering being >>> enforced or even described in the dmadev API or documentation. >>> Please, point me to the correct documentation, if I somehow missed it. >>> >>> We have a DMA device (A) and a CPU core (B) writing respectively >>> the data and the descriptor info. CPU core (C) is reading the >>> descriptor and the data it points too. >>> >>> A few things about that process: >>> >>> 1. There is no memory barrier between writes A and B (Did I miss >>> them?). Meaning that those operations can be seen by C in a >>> different order regardless of barriers issued by C and regardless >>> of the nature of devices A and B. >>> >>> 2. Even if there is a write barrier between A and B, there is >>> no guarantee that C will see these writes in the same order >>> as C doesn't use real memory barriers because vhost advertises >> >> s/advertises/does not advertise/ >> >>> VIRTIO_F_ORDER_PLATFORM. >>> >>> So, I'm getting to conclusion that there is a missing write barrier >>> on the vhost side and vhost itself must not advertise the >> >> s/must not/must/ >> >> Sorry, I wrote things backwards. :) >> >>> VIRTIO_F_ORDER_PLATFORM, so the virtio driver can use actual memory >>> barriers. >>> >>> Would like to hear some thoughts on that topic. Is it a real issue? >>> Is it an issue considering all possible CPU architectures and DMA >>> HW variants? >>> > > In terms of ordering of operations using dmadev: > > * Some DMA HW will perform all operations strictly in order e.g. Intel > IOAT, while other hardware may not guarantee order of operations/do > things in parallel e.g. Intel DSA. Therefore the dmadev API provides the > fence operation which allows the order to be enforced. The fence can be > thought of as a full memory barrier, meaning no jobs after the barrier can > be started until all those before it have completed. Obviously, for HW > where order is always enforced, this will be a no-op, but for hardware that > parallelizes, we want to reduce the fences to get best performance. > > * For synchronization between DMA devices and CPUs, where a CPU can only > write after a DMA copy has been done, the CPU must wait for the dma > completion to guarantee ordering. Once the completion has been returned > the completed operation is globally visible to all cores.
Thanks for explanation! Some questions though: In our case one CPU waits for completion and another CPU is actually using the data. IOW, "CPU must wait" is a bit ambiguous. Which CPU must wait? Or should it be "Once the completion is visible on any core, the completed operation is globally visible to all cores." ? And the main question: Are these synchronization claims documented somewhere? Best regards, Ilya Maximets.