Re: [PATCH v2 5/8] dma-mapping: Support batch mode for dma_direct_sync_sg_for_*

Robin Murphy Wed, 07 Jan 2026 05:17:35 -0800

On 2026-01-06 7:47 pm, Barry Song wrote:

On Wed, Jan 7, 2026 at 8:12 AM Robin Murphy <[email protected]> wrote:


On 2026-01-06 6:41 pm, Barry Song wrote:

On Mon, Dec 29, 2025 at 3:50 AM Leon Romanovsky <[email protected]> wrote:


On Sun, Dec 28, 2025 at 09:52:05AM +1300, Barry Song wrote:

On Sun, Dec 28, 2025 at 9:09 AM Leon Romanovsky <[email protected]> wrote:


On Sat, Dec 27, 2025 at 11:52:45AM +1300, Barry Song wrote:

From: Barry Song <[email protected]>

Instead of performing a flush per SG entry, issue all cache
operations first and then flush once. This ultimately benefits
__dma_sync_sg_for_cpu() and __dma_sync_sg_for_device().

Cc: Leon Romanovsky <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: Ada Couprie Diaz <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tangquan Zheng <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
   kernel/dma/direct.c | 14 +++++++-------
   1 file changed, 7 insertions(+), 7 deletions(-)


<...>

-             if (!dev_is_dma_coherent(dev)) {
+             if (!dev_is_dma_coherent(dev))
                        arch_sync_dma_for_device(paddr, sg->length,
                                        dir);
-                     arch_sync_dma_flush();
-             }
        }
+     if (!dev_is_dma_coherent(dev))
+             arch_sync_dma_flush();


This patch should be squashed into the previous one. You introduced
arch_sync_dma_flush() there, and now you are placing it elsewhere.


Hi Leon,

The previous patch replaces all arch_sync_dma_for_* calls with
arch_sync_dma_for_* plus arch_sync_dma_flush(), without any
functional change. The subsequent patches then implement the
actual batching. I feel this is a better approach for reviewing
each change independently. Otherwise, the previous patch would
be too large.


Don't worry about it. Your patches are small enough.


My hardware does not require a bounce buffer, but I am concerned that
this patch may be incorrect for systems that do require one.

Now it is:

void dma_direct_sync_sg_for_cpu(struct device *dev,
                  struct scatterlist *sgl, int nents, enum dma_data_direction 
dir)
{
          struct scatterlist *sg;
          int i;

          for_each_sg(sgl, sg, nents, i) {
                  phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));

                  if (!dev_is_dma_coherent(dev))
                          arch_sync_dma_for_cpu(paddr, sg->length, dir);

                  swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);

                  if (dir == DMA_FROM_DEVICE)
                          arch_dma_mark_clean(paddr, sg->length);
          }

          if (!dev_is_dma_coherent(dev)) {
                  arch_sync_dma_flush();
                  arch_sync_dma_for_cpu_all();
          }
}

Should we call swiotlb_sync_single_for_cpu() and
arch_dma_mark_clean() after the flush to ensure the CPU sees the
latest data and that the memcpy is correct? I mean:


Yes, this and the equivalents in the later patches are broken for all
the sync_for_cpu and unmap paths which may end up bouncing (beware some
of them get a bit fiddly) - any cache maintenance *must* be completed
before calling SWIOTLB. As for mark_clean, IIRC that was an IA-64 thing,
and appears to be entirely dead now.


Thanks, Robin. Personally, I would prefer an approach like the one below—
that is, not optimizing the bounce buffer cases, as they are already slow
due to hardware limitations with memcpy, and optimizing them would make
the code quite messy.

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 550a1a13148d..a4840f7e8722 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -423,8 +423,11 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
         for_each_sg(sgl, sg, nents, i) {
                 phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));

-               if (!dev_is_dma_coherent(dev))
+               if (!dev_is_dma_coherent(dev)) {
                         arch_sync_dma_for_cpu(paddr, sg->length, dir);
+                       if (unlikely(dev->dma_io_tlb_mem))
+                               arch_sync_dma_flush();
+               }

                 swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);

I’d like to check with you, Leon, and Marek on your views about this.

That doesn't work, since dma_io_tlb_mem is always initialised if aSWIOTLB buffer exists at all. Similarly I think the existingdma_need_sync tracking is also too coarse, as that's also always goingto be true for a non-coherent device.

Really this flush wants to be after the swiotlb_find_pool() check in theswiotlb_tbl_unmap_single()/__swiotlb_sync_single_for_cpu() paths, asthat's the only point we know for sure it's definitely needed for thegiven address. It would then be rather fiddly to avoidpotentially-redundant flushes for the non-sg cases (and the finalsegment of an sg), but as you already mentioned, if it's limited tocases when we *are* already paying the cost of bouncing anyway, perhapsone extra DSB isn't *too* bad if it means zero impact to thenon-bouncing paths.


Thanks,
Robin.

Re: [PATCH v2 5/8] dma-mapping: Support batch mode for dma_direct_sync_sg_for_*

Reply via email to