On 09/12/2025 11:37, Barry Song wrote: > On Mon, Dec 8, 2025 at 6:38 PM Ryan Roberts <[email protected]> wrote: >> >> On 08/12/2025 09:52, Barry Song wrote: >>> On Mon, Dec 8, 2025 at 5:41 PM gao xu <[email protected]> wrote: >>>> >>>> commit 04c7adb5871a ("dma-buf: system_heap: use larger contiguous mappings >>>> instead of per-page mmap") facilitates the use of PTE_CONT. The system_heap >>>> allocates pages of order 4 and 8 that meet the alignment requirements for >>>> PTE_CONT. enabling PTE_CONT for larger contiguous mappings. >>> >>> Unfortunately, we don't have pte_cont for architectures other than >>> AArch64. On the other hand, AArch64 isn't automatically mapping >>> cont_pte for mmap. It might be better if this were done >>> automatically by the ARM code. >> >> Yes indeed; CONT_PTE_MASK and PTE_CONT are arm64-specific macros that cannot >> be >> used outside of the arm64 arch code. >> >>> >>> Ryan(Cced) is the expert on automatically setting cont_pte for >>> contiguous mapping, so let's ask for some advice from Ryan. >> >> arm64 arch code will automatically and transparently apply PTE_CONT whenever >> it >> detects suitable conditions. Those suitable conditions include: >> >> - physically contiguous block of 64K, aligned to 64K >> - virtually contiguous block of 64K, aligned to 64K >> - 64K block has the same access permissions >> - 64K block all belongs to the same folio >> - not a special mapping >> >> The last 2 requirements are the tricky ones here: We require that every page >> in >> the block belongs to the same folio because a contigous mapping only >> maintains a >> single access and dirty bit for the whole 64K block, so we are losing >> fidelity >> vs per-page mappings. But the kernel tracks access/dirty per folio, so the >> extra >> fidelity we get for per-page mappings is ingored by the kernel anyway if the >> contiguous mapping only maps pages from a single folio. We reject special >> mappings because they are not backed by a folio at all. >> >> For your case, remap_pfn_range() will create special mappings so we will >> never >> set the PTE_CONT bit. >> >> Likely we are being a bit too conservative here and we may be able to relax >> this >> requirement if we know that nothing will ever consume the access/dirty >> information for special mappings? I'm not if that is the case in general >> though >> - it would need some investigation. >> >> With that issue resolved, there is still a second issue; there are 2 ways the >> arm64 arch code detects suitable contiguous mappings. The primary way is via >> a >> call to set_ptes(). This part of the "PTE batching" API and explicitly tells >> the >> implementaiton that all the conditions are met (including the memory being >> backed by a folio). This is the most efficient approach. See >> contpte_set_ptes(). >> >> There is a second (hacky) approach which attempts to recognise when the last >> PTE >> of a contiguous block is set and automatically "fold" the mapping. See >> contpte_try_fold(). This approach has a cost because (for systems without >> BBML2_NOABORT) we have to issue a TLBI when we fold the range. >> >> For remap_pfn_range(), we would be relying on the second approach since it is >> not currently batched (and could not use set_ptes() as currently spec'ed due >> to >> there being no folio). If we are going to add support for contiguous >> pfn-mapped >> PTEs, it would be preferable to add equivalent batching APIs (or relax >> set_ptes()). >> > > Thanks a lot, Ryan. It seems quite tricky to support automatic cont_pte. > >> I think this would be a useful improvement, but it's not as straightforward >> as >> adding PTE_CONT in system_heap_mmap(). > > Since it's just a driver, I'm not sure if it's acceptable to use CONFIG_ARM64. > However, I can find many instances of it in drivers. > drivers % git grep CONFIG_ARM64 | wc -l > 127 > > On the other hand, a corner case is when the dma-buf is partially unmapped. > I assume cont_pte can still be automatically unfolded, even for > special mappings?
I think unfolding will probably happen to work, but you're definitely in the neighbourhood of "horrible hack that may not work as intended in some corner cases". I think it would be much better to support batching for pfn-mapped ptes. That would generalize to many more users. (and I might be interested in taking a look at some point next year if nobody else gets to it). We deliberately didn't want to expose the idea of a single, specific contiguous size to the generic code so that the arch could make more fine-grained decisions. :) Thanks, Ryan > > Thanks > Barry
