Hi, Christian,

On Tue, 2025-08-26 at 11:56 +0200, Christian König wrote:
> On 26.08.25 11:17, David Hildenbrand wrote:
> > On 26.08.25 11:00, Christian König wrote:
> > > On 26.08.25 10:46, David Hildenbrand wrote:
> > > > > > So my assumption would be that that is missing for the
> > > > > > drivers here?
> > > > > 
> > > > > Well yes and no.
> > > > > 
> > > > > See the PAT is optimized for applying specific caching
> > > > > attributes to ranges [A..B] (e.g. it uses an R/B tree). But
> > > > > what drivers do here is that they have single pages (usually
> > > > > for get_free_page or similar) and want to apply a certain
> > > > > caching attribute to it.
> > > > > 
> > > > > So what would happen is that we completely clutter the R/B
> > > > > tree used by the PAT with thousands if not millions of
> > > > > entries.
> > > > > 
> > > > 
> > > > Hm, above you're saying that there is no direct map, but now
> > > > you are saying that the pages were obtained through
> > > > get_free_page()?
> > > 
> > > The problem only happens with highmem pages on 32bit kernels.
> > > Those pages are not in the linear mapping.
> > 
> > Right, in the common case there is a direct map.
> > 
> > > 
> > > > I agree that what you describe here sounds suboptimal. But if
> > > > the pages where obtained from the buddy, there surely is a
> > > > direct map -- unless we explicitly remove it :(
> > > > 
> > > > If we're talking about individual pages without a directmap, I
> > > > would wonder if they are actually part of a bigger memory
> > > > region that can just be reserved in one go (similar to how
> > > > remap_pfn_range()) would handle it.
> > > > 
> > > > Can you briefly describe how your use case obtains these PFNs,
> > > > and how scattered tehy + their caching attributes might be?
> > > 
> > > What drivers do is to call get_free_page() or alloc_pages_node()
> > > with the GFP_HIGHUSER flag set.
> > > 
> > > For non highmem pages drivers then calls set_pages_wc/uc() which
> > > changes the caching of the linear mapping, but for highmem pages
> > > there is no linear mapping so set_pages_wc() or set_pages_uc()
> > > doesn't work and drivers avoid calling it.
> > > 
> > > Those are basically just random system memory pages. So they are
> > > potentially scattered over the whole memory address space.
> > 
> > Thanks, that's valuable information.
> > 
> > So essentially these drivers maintain their own consistency and PAT
> > is not aware of that.
> > 
> > And the real problem is ordinary system RAM.
> > 
> > There are various ways forward.
> > 
> > 1) We use another interface that consumes pages instead of PFNs,
> > like a
> >    vm_insert_pages_pgprot() we would be adding.
> > 
> >    Is there any strong requirement for inserting non-refcounted
> > PFNs?
> 
> Yes, there is a strong requirement to insert non-refcounted PFNs.
> 
> We had a lot of trouble with KVM people trying to grab a reference to
> those pages even if the VMA had the VM_PFNMAP flag set.
> 
> > 2) We add another interface that consumes PFNs, but explicitly
> > states
> >    that it is only for ordinary system RAM, and that the user is
> >    required for updating the direct map.
> > 
> >    We could sanity-check the direct map in debug kernels.
> 
> I would rather like to see vmf_insert_pfn_prot() fixed instead.
> 
> That function was explicitly added to insert the PFN with the given
> attributes and as far as I can see all users of that function expect
> exactly that.
> 
> > 
> > 3) We teach PAT code in pfnmap_setup_cachemode_pfn() about treating
> > this
> >    system RAM differently.
> > 
> > 
> > There is also the option for a mixture between 1 and 2, where we
> > get pages, but we map them non-refcounted in a VM_PFNMAP.
> > 
> > In general, having pages makes it easier to assert that they are
> > likely ordinary system ram pages, and that the interface is not
> > getting abused for something else.
> 
> Well, exactly that's the use case here and that is not abusive at all
> as far as I can see.
> 
> What drivers want is to insert a PFN with a certain set of caching
> attributes regardless if it's system memory or iomem. That's why
> vmf_insert_pfn_prot() was created in the first place.
> 
> That drivers need to call set_pages_wc/uc() for the linear mapping on
> x86 manually is correct and checking that is clearly a good idea for
> debug kernels.

So where is this trending? Is the current suggestion to continue
disallowing aliased mappings with conflicting caching modes and enforce
checks in debug kernels?

/Thomas


> 
> > We could also perform the set_pages_wc/uc() from inside that
> > function, but maybe it depends on the use case whether we want to
> > do that whenever we map them into a process?
> 
> It sounds like a good idea in theory, but I think it is potentially
> to much overhead to be applicable.
> 
> Thanks,
> Christian.

Reply via email to