On Wed, Jan 07, 2026 at 08:47:36AM +1300, Barry Song wrote: > On Wed, Jan 7, 2026 at 8:12 AM Robin Murphy <[email protected]> wrote: > > > > On 2026-01-06 6:41 pm, Barry Song wrote: > > > On Mon, Dec 29, 2025 at 3:50 AM Leon Romanovsky <[email protected]> wrote: > > >> > > >> On Sun, Dec 28, 2025 at 09:52:05AM +1300, Barry Song wrote: > > >>> On Sun, Dec 28, 2025 at 9:09 AM Leon Romanovsky <[email protected]> wrote: > > >>>> > > >>>> On Sat, Dec 27, 2025 at 11:52:45AM +1300, Barry Song wrote: > > >>>>> From: Barry Song <[email protected]> > > >>>>> > > >>>>> Instead of performing a flush per SG entry, issue all cache > > >>>>> operations first and then flush once. This ultimately benefits > > >>>>> __dma_sync_sg_for_cpu() and __dma_sync_sg_for_device(). > > >>>>> > > >>>>> Cc: Leon Romanovsky <[email protected]> > > >>>>> Cc: Catalin Marinas <[email protected]> > > >>>>> Cc: Will Deacon <[email protected]> > > >>>>> Cc: Marek Szyprowski <[email protected]> > > >>>>> Cc: Robin Murphy <[email protected]> > > >>>>> Cc: Ada Couprie Diaz <[email protected]> > > >>>>> Cc: Ard Biesheuvel <[email protected]> > > >>>>> Cc: Marc Zyngier <[email protected]> > > >>>>> Cc: Anshuman Khandual <[email protected]> > > >>>>> Cc: Ryan Roberts <[email protected]> > > >>>>> Cc: Suren Baghdasaryan <[email protected]> > > >>>>> Cc: Tangquan Zheng <[email protected]> > > >>>>> Signed-off-by: Barry Song <[email protected]> > > >>>>> --- > > >>>>> kernel/dma/direct.c | 14 +++++++------- > > >>>>> 1 file changed, 7 insertions(+), 7 deletions(-) > > >>>> > > >>>> <...> > > >>>> > > >>>>> - if (!dev_is_dma_coherent(dev)) { > > >>>>> + if (!dev_is_dma_coherent(dev)) > > >>>>> arch_sync_dma_for_device(paddr, sg->length, > > >>>>> dir); > > >>>>> - arch_sync_dma_flush(); > > >>>>> - } > > >>>>> } > > >>>>> + if (!dev_is_dma_coherent(dev)) > > >>>>> + arch_sync_dma_flush(); > > >>>> > > >>>> This patch should be squashed into the previous one. You introduced > > >>>> arch_sync_dma_flush() there, and now you are placing it elsewhere. > > >>> > > >>> Hi Leon, > > >>> > > >>> The previous patch replaces all arch_sync_dma_for_* calls with > > >>> arch_sync_dma_for_* plus arch_sync_dma_flush(), without any > > >>> functional change. The subsequent patches then implement the > > >>> actual batching. I feel this is a better approach for reviewing > > >>> each change independently. Otherwise, the previous patch would > > >>> be too large. > > >> > > >> Don't worry about it. Your patches are small enough. > > > > > > My hardware does not require a bounce buffer, but I am concerned that > > > this patch may be incorrect for systems that do require one. > > > > > > Now it is: > > > > > > void dma_direct_sync_sg_for_cpu(struct device *dev, > > > struct scatterlist *sgl, int nents, enum > > > dma_data_direction dir) > > > { > > > struct scatterlist *sg; > > > int i; > > > > > > for_each_sg(sgl, sg, nents, i) { > > > phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg)); > > > > > > if (!dev_is_dma_coherent(dev)) > > > arch_sync_dma_for_cpu(paddr, sg->length, dir); > > > > > > swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir); > > > > > > if (dir == DMA_FROM_DEVICE) > > > arch_dma_mark_clean(paddr, sg->length); > > > } > > > > > > if (!dev_is_dma_coherent(dev)) { > > > arch_sync_dma_flush(); > > > arch_sync_dma_for_cpu_all(); > > > } > > > } > > > > > > Should we call swiotlb_sync_single_for_cpu() and > > > arch_dma_mark_clean() after the flush to ensure the CPU sees the > > > latest data and that the memcpy is correct? I mean: > > > > Yes, this and the equivalents in the later patches are broken for all > > the sync_for_cpu and unmap paths which may end up bouncing (beware some > > of them get a bit fiddly) - any cache maintenance *must* be completed > > before calling SWIOTLB. As for mark_clean, IIRC that was an IA-64 thing, > > and appears to be entirely dead now. > > Thanks, Robin. Personally, I would prefer an approach like the one below— > that is, not optimizing the bounce buffer cases, as they are already slow > due to hardware limitations with memcpy, and optimizing them would make > the code quite messy. > > diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c > index 550a1a13148d..a4840f7e8722 100644 > --- a/kernel/dma/direct.c > +++ b/kernel/dma/direct.c > @@ -423,8 +423,11 @@ void dma_direct_sync_sg_for_cpu(struct device *dev, > for_each_sg(sgl, sg, nents, i) { > phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg)); > > - if (!dev_is_dma_coherent(dev)) > + if (!dev_is_dma_coherent(dev)) { > arch_sync_dma_for_cpu(paddr, sg->length, dir); > + if (unlikely(dev->dma_io_tlb_mem)) > + arch_sync_dma_flush(); > + } > > swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir); > > I’d like to check with you, Leon, and Marek on your views about this.
I agree with your point that the non‑SWIOTLB path is the performant one and should be preferred. My concern is that you are accessing the dma_io_tlb_mem variable directly from direct.c, which looks like a layer violation. You likely need to introduce an is_swiotlb_something() helper for this. BTW, please send a v3 instead of posting incremental follow‑ups. It's hard to track the changes across multiple small additions. Thanks. > > Thanks > Barry
