deferred-io: Support contiguous kernel memory framebuffers

Michael Kelley Tue, 03 Jun 2025 10:26:39 -0700

From: David Hildenbrand <da...@redhat.com> Sent: Tuesday, June 3, 2025 12:55 AM
> 
> On 03.06.25 03:49, Michael Kelley wrote:
> > From: David Hildenbrand <da...@redhat.com> Sent: Monday, June 2, 2025 2:48 
> > AM
> >>
> >> On 23.05.25 18:15, mhkelle...@gmail.com wrote:
> >>> From: Michael Kelley <mhkli...@outlook.com>
> >>>
> >>> Current defio code works only for framebuffer memory that is allocated
> >>> with vmalloc(). The code assumes that the underlying page refcount can
> >>> be used by the mm subsystem to manage each framebuffer page's lifecycle,
> >>> including freeing the page if the refcount goes to 0. This approach is
> >>> consistent with vmalloc'ed memory, but not with contiguous kernel memory
> >>> allocated via alloc_pages() or similar. The latter such memory pages
> >>> usually have a refcount of 0 when allocated, and would be incorrectly
> >>> freed page-by-page if used with defio. That free'ing corrupts the memory
> >>> free lists and Linux eventually panics. Simply bumping the refcount after
> >>> allocation doesn’t work because when the framebuffer memory is freed,
> >>> __free_pages() complains about non-zero refcounts.
> >>>
> >>> Commit 37b4837959cb ("video: deferred io with physically contiguous
> >>> memory") from the year 2008 purported to add support for contiguous
> >>> kernel memory framebuffers. The motivating device, sh_mobile_lcdcfb, uses
> >>> dma_alloc_coherent() to allocate framebuffer memory, which is likely to
> >>> use alloc_pages(). It's unclear to me how this commit actually worked at
> >>> the time, unless dma_alloc_coherent() was pulling from a CMA pool instead
> >>> of alloc_pages(). Or perhaps alloc_pages() worked differently or on the
> >>> arm32 architecture on which sh_mobile_lcdcfb is used.
> >>>
> >>> In any case, for x86 and arm64 today, commit 37b4837959cb9 is not
> >>> sufficient to support contiguous kernel memory framebuffers. The problem
> >>> can be seen with the hyperv_fb driver, which may allocate the framebuffer
> >>> memory using vmalloc() or alloc_pages(), depending on the configuration
> >>> of the Hyper-V guest VM (Gen 1 vs. Gen 2) and the size of the framebuffer.
> >>>
> >>> Fix this limitation by adding defio support for contiguous kernel memory
> >>> framebuffers. A driver with a framebuffer allocated from contiguous
> >>> kernel memory must set the FBINFO_KMEMFB flag to indicate such.
> >>>
> >>> Tested with the hyperv_fb driver in both configurations -- with a 
> >>> vmalloc()
> >>> framebuffer and with an alloc_pages() framebuffer on x86. Also verified a
> >>> vmalloc() framebuffer on arm64. Hardware is not available to me to verify
> >>> that the older arm32 devices still work correctly, but the path for
> >>> vmalloc() framebuffers is essentially unchanged.
> >>>
> >>> Even with these changes, defio does not support framebuffers in MMIO
> >>> space, as defio code depends on framebuffer memory pages having
> >>> corresponding 'struct page's.
> >>>
> >>> Fixes: 3a6fb6c4255c ("video: hyperv: hyperv_fb: Use physical memory for 
> >>> fb on HyperV
> Gen 1 VMs.")
> >>> Signed-off-by: Michael Kelley <mhkli...@outlook.com>
> >>> ---
> >>> Changes in v3:
> >>> * Moved definition of FBINFO_KMEMFB flag to a separate patch
> >>>     preceeding this one in the patch set [Helge Deller]
> >>> Changes in v2:
> >>> * Tweaked code comments regarding framebuffers allocated with
> >>>     dma_alloc_coherent() [Christoph Hellwig]
> >>>
> >>>    drivers/video/fbdev/core/fb_defio.c | 128 +++++++++++++++++++++++-----
> >>>    1 file changed, 108 insertions(+), 20 deletions(-)
> >>>
> >>> diff --git a/drivers/video/fbdev/core/fb_defio.c 
> >>> b/drivers/video/fbdev/core/fb_defio.c
> >>> index 4fc93f253e06..f8ae91a1c4df 100644
> >>> --- a/drivers/video/fbdev/core/fb_defio.c
> >>> +++ b/drivers/video/fbdev/core/fb_defio.c
> >>> @@ -8,11 +8,40 @@
> >>>     * for more details.
> >>>     */
> >>>
> >>> +/*
> >>> + * Deferred I/O ("defio") allows framebuffers that are mmap()'ed to user 
> >>> space
> >>> + * to batch user space writes into periodic updates to the underlying
> >>> + * framebuffer hardware or other implementation (such as with a 
> >>> virtualized
> >>> + * framebuffer in a VM). At each batch interval, a callback is invoked 
> >>> in the
> >>> + * framebuffer's kernel driver, and the callback is supplied with a list 
> >>> of
> >>> + * pages that have been modified in the preceding interval. The callback 
> >>> can
> >>> + * use this information to update the framebuffer hardware as necessary. 
> >>> The
> >>> + * batching can improve performance and reduce the overhead of updating 
> >>> the
> >>> + * hardware.
> >>> + *
> >>> + * Defio is supported on framebuffers allocated using vmalloc() and 
> >>> allocated
> >>> + * as contiguous kernel memory using alloc_pages() or kmalloc(). These
> >>> + * memory allocations all have corresponding "struct page"s. Framebuffers
> >>> + * allocated using dma_alloc_coherent() should not be used with defio.
> >>> + * Such allocations should be treated as a black box owned by the DMA
> >>> + * layer, and should not be deconstructed into individual pages as defio
> >>> + * does. Framebuffers in MMIO space are *not* supported because MMIO 
> >>> space
> >>> + * does not have corrresponding "struct page"s.
> >>> + *
> >>> + * For framebuffers allocated using vmalloc(), struct fb_info must have
> >>> + * "screen_buffer" set to the vmalloc address of the framebuffer. For
> >>> + * framebuffers allocated from contiguous kernel memory, FBINFO_KMEMFB 
> >>> must
> >>> + * be set, and "fix.smem_start" must be set to the physical address of 
> >>> the
> >>> + * frame buffer. In both cases, "fix.smem_len" must be set to the 
> >>> framebuffer
> >>> + * size in bytes.
> >>> + */
> >>> +
> >>>    #include <linux/module.h>
> >>>    #include <linux/kernel.h>
> >>>    #include <linux/errno.h>
> >>>    #include <linux/string.h>
> >>>    #include <linux/mm.h>
> >>> +#include <linux/pfn_t.h>
> >>>    #include <linux/vmalloc.h>
> >>>    #include <linux/delay.h>
> >>>    #include <linux/interrupt.h>
> >>> @@ -37,7 +66,7 @@ static struct page *fb_deferred_io_get_page(struct 
> >>> fb_info *info, unsigned long
> >>>           else if (info->fix.smem_start)
> >>>                   page = pfn_to_page((info->fix.smem_start + offs) >> 
> >>> PAGE_SHIFT);
> >>>
> >>> - if (page)
> >>> + if (page && !(info->flags & FBINFO_KMEMFB))
> >>>                   get_page(page);
> >>>
> >>>           return page;
> >>> @@ -137,6 +166,15 @@ static vm_fault_t fb_deferred_io_fault(struct 
> >>> vm_fault *vmf)
> >>>
> >>>           BUG_ON(!info->fbdefio->mapping);
> >>>
> >>> + if (info->flags & FBINFO_KMEMFB)
> >>> +         /*
> >>> +          * In this path, the VMA is marked VM_PFNMAP, so mm assumes
> >>> +          * there is no struct page associated with the page. The
> >>> +          * PFN must be directly inserted and the created PTE will be
> >>> +          * marked "special".
> >>> +          */
> >>> +         return vmf_insert_pfn(vmf->vma, vmf->address, 
> >>> page_to_pfn(page));
> >>> +
> >>>           vmf->page = page;
> >>>           return 0;
> >>>    }
> >>> @@ -163,13 +201,14 @@ EXPORT_SYMBOL_GPL(fb_deferred_io_fsync);
> >>>
> >>>    /*
> >>>     * Adds a page to the dirty list. Call this from struct
> >>> - * vm_operations_struct.page_mkwrite.
> >>> + * vm_operations_struct.page_mkwrite or .pfn_mkwrite.
> >>>     */
> >>> -static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, 
> >>> unsigned long offset,
> >>> +static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, struct 
> >>> vm_fault *vmf,
> >>>                                               struct page *page)
> >>>    {
> >>>           struct fb_deferred_io *fbdefio = info->fbdefio;
> >>>           struct fb_deferred_io_pageref *pageref;
> >>> + unsigned long offset = vmf->pgoff << PAGE_SHIFT;
> >>>           vm_fault_t ret;
> >>>
> >>>           /* protect against the workqueue changing the page list */
> >>> @@ -182,20 +221,34 @@ static vm_fault_t fb_deferred_io_track_page(struct 
> >>> fb_info *info, unsigned long
> >>>           }
> >>>
> >>>           /*
> >>> -  * We want the page to remain locked from ->page_mkwrite until
> >>> -  * the PTE is marked dirty to avoid mapping_wrprotect_range()
> >>> -  * being called before the PTE is updated, which would leave
> >>> -  * the page ignored by defio.
> >>> -  * Do this by locking the page here and informing the caller
> >>> -  * about it with VM_FAULT_LOCKED.
> >>> +  * The PTE must be marked writable before the defio deferred work runs
> >>> +  * again and potentially marks the PTE write-protected. If the order
> >>> +  * should be switched, the PTE would become writable without defio
> >>> +  * tracking the page, leaving the page forever ignored by defio.
> >>> +  *
> >>> +  * For vmalloc() framebuffers, the associated struct page is locked
> >>> +  * before releasing the defio lock. mm will later mark the PTE writaable
> >>> +  * and release the struct page lock. The struct page lock prevents
> >>> +  * the page from being prematurely being marked write-protected.
> >>> +  *
> >>> +  * For FBINFO_KMEMFB framebuffers, mm assumes there is no struct page,
> >>> +  * so the PTE must be marked writable while the defio lock is held.
> >>>            */
> >>> - lock_page(pageref->page);
> >>> + if (info->flags & FBINFO_KMEMFB) {
> >>> +         unsigned long pfn = page_to_pfn(pageref->page);
> >>> +
> >>> +         ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address,
> >>> +                                        __pfn_to_pfn_t(pfn, 
> >>> PFN_SPECIAL));
> >>
> >> Will the VMA have VM_PFNMAP or VM_MIXEDMAP set? PFN_SPECIAL is a
> >> horrible hack.
> >>
> >> In another thread, you mention that you use PFN_SPECIAL to bypass the
> >> check in vm_mixed_ok(), so VM_MIXEDMAP is likely not set?
> >
> > The VMA has VM_PFNMAP set, not VM_MIXEDMAP.  It seemed like
> > VM_MIXEDMAP is somewhat of a superset of VM_PFNMAP, but maybe that's
> > a wrong impression.
> 
> VM_PFNMAP: nothing is refcounted except anon pages
> 
> VM_MIXEDMAP: anything with a "struct page" (pfn_valid()) is refcounted
> 
> pte_special() is a way for GUP-fast to distinguish these refcounted (can
> GUP) from non-refcounted (camnnot GUP) pages mapped by PTEs without any
> locks or the VMA being available.
> 
> Setting pte_special() in VM_MIXEDMAP on ptes that have a "struct page"
> (pfn_valid()) is likely very bogus.


OK, good to know.

> 
> > vm_mixed_ok() does a thorough job of validating the
> > use of __vm_insert_mixed(), and since what I did was allowed, I thought
> > perhaps it was OK. Your feedback has set me straight, and that's what I
> > needed. :-)
> 
> What exactly are you trying to achieve? :)
> 
> If it's mapping a page with a "struct page" and *not* refcounting it,
> then vmf_insert_pfn() is the current way to achieve that in a VM_PFNMAP
> mapping. It will set pte_special() automatically for you.
> 

Yes, that's what I'm using to initially create the special PTE in the
.fault callback.

> >
> > But the whole approach is moot with Alistair Popple's patch set that
> > eliminates pfn_t. Is there an existing mm API that will do mkwrite on a
> > special PTE in a VM_PFNMAP VMA? I didn't see one, but maybe I missed
> > it. If there's not one, I'll take a crack at adding it in the next version 
> > of my
> > patch set.
> 
> I assume you'd want vmf_insert_pfn_mkwrite(), correct? Probably
> vmf_insert_pfn_prot() can be used by adding PAGE_WRITE to pgprot. (maybe
> :) )

Ok, I'll look at that more closely. The sequence is that the special
PTE gets created with vmf_insert_pfn(). Then when the page is first
written to, the .pfn_mkwrite callback is invoked by mm. The question
is the best way for that callback to mark the existing PTE as writable.

Thanks,

Michael

RE: [PATCH v3 3/4] fbdev/deferred-io: Support contiguous kernel memory framebuffers

Reply via email to