Hi, Christian, On Tue, 2025-08-26 at 11:56 +0200, Christian König wrote: > On 26.08.25 11:17, David Hildenbrand wrote: > > On 26.08.25 11:00, Christian König wrote: > > > On 26.08.25 10:46, David Hildenbrand wrote: > > > > > > So my assumption would be that that is missing for the > > > > > > drivers here? > > > > > > > > > > Well yes and no. > > > > > > > > > > See the PAT is optimized for applying specific caching > > > > > attributes to ranges [A..B] (e.g. it uses an R/B tree). But > > > > > what drivers do here is that they have single pages (usually > > > > > for get_free_page or similar) and want to apply a certain > > > > > caching attribute to it. > > > > > > > > > > So what would happen is that we completely clutter the R/B > > > > > tree used by the PAT with thousands if not millions of > > > > > entries. > > > > > > > > > > > > > Hm, above you're saying that there is no direct map, but now > > > > you are saying that the pages were obtained through > > > > get_free_page()? > > > > > > The problem only happens with highmem pages on 32bit kernels. > > > Those pages are not in the linear mapping. > > > > Right, in the common case there is a direct map. > > > > > > > > > I agree that what you describe here sounds suboptimal. But if > > > > the pages where obtained from the buddy, there surely is a > > > > direct map -- unless we explicitly remove it :( > > > > > > > > If we're talking about individual pages without a directmap, I > > > > would wonder if they are actually part of a bigger memory > > > > region that can just be reserved in one go (similar to how > > > > remap_pfn_range()) would handle it. > > > > > > > > Can you briefly describe how your use case obtains these PFNs, > > > > and how scattered tehy + their caching attributes might be? > > > > > > What drivers do is to call get_free_page() or alloc_pages_node() > > > with the GFP_HIGHUSER flag set. > > > > > > For non highmem pages drivers then calls set_pages_wc/uc() which > > > changes the caching of the linear mapping, but for highmem pages > > > there is no linear mapping so set_pages_wc() or set_pages_uc() > > > doesn't work and drivers avoid calling it. > > > > > > Those are basically just random system memory pages. So they are > > > potentially scattered over the whole memory address space. > > > > Thanks, that's valuable information. > > > > So essentially these drivers maintain their own consistency and PAT > > is not aware of that. > > > > And the real problem is ordinary system RAM. > > > > There are various ways forward. > > > > 1) We use another interface that consumes pages instead of PFNs, > > like a > > vm_insert_pages_pgprot() we would be adding. > > > > Is there any strong requirement for inserting non-refcounted > > PFNs? > > Yes, there is a strong requirement to insert non-refcounted PFNs. > > We had a lot of trouble with KVM people trying to grab a reference to > those pages even if the VMA had the VM_PFNMAP flag set. > > > 2) We add another interface that consumes PFNs, but explicitly > > states > > that it is only for ordinary system RAM, and that the user is > > required for updating the direct map. > > > > We could sanity-check the direct map in debug kernels. > > I would rather like to see vmf_insert_pfn_prot() fixed instead. > > That function was explicitly added to insert the PFN with the given > attributes and as far as I can see all users of that function expect > exactly that. > > > > > 3) We teach PAT code in pfnmap_setup_cachemode_pfn() about treating > > this > > system RAM differently. > > > > > > There is also the option for a mixture between 1 and 2, where we > > get pages, but we map them non-refcounted in a VM_PFNMAP. > > > > In general, having pages makes it easier to assert that they are > > likely ordinary system ram pages, and that the interface is not > > getting abused for something else. > > Well, exactly that's the use case here and that is not abusive at all > as far as I can see. > > What drivers want is to insert a PFN with a certain set of caching > attributes regardless if it's system memory or iomem. That's why > vmf_insert_pfn_prot() was created in the first place. > > That drivers need to call set_pages_wc/uc() for the linear mapping on > x86 manually is correct and checking that is clearly a good idea for > debug kernels.
So where is this trending? Is the current suggestion to continue disallowing aliased mappings with conflicting caching modes and enforce checks in debug kernels? /Thomas > > > We could also perform the set_pages_wc/uc() from inside that > > function, but maybe it depends on the use case whether we want to > > do that whenever we map them into a process? > > It sounds like a good idea in theory, but I think it is potentially > to much overhead to be applicable. > > Thanks, > Christian.