From: David Hildenbrand <da...@redhat.com> Sent: Tuesday, June 3, 2025 12:55 AM > > On 03.06.25 03:49, Michael Kelley wrote: > > From: David Hildenbrand <da...@redhat.com> Sent: Monday, June 2, 2025 2:48 > > AM > >> > >> On 23.05.25 18:15, mhkelle...@gmail.com wrote: > >>> From: Michael Kelley <mhkli...@outlook.com> > >>> > >>> Current defio code works only for framebuffer memory that is allocated > >>> with vmalloc(). The code assumes that the underlying page refcount can > >>> be used by the mm subsystem to manage each framebuffer page's lifecycle, > >>> including freeing the page if the refcount goes to 0. This approach is > >>> consistent with vmalloc'ed memory, but not with contiguous kernel memory > >>> allocated via alloc_pages() or similar. The latter such memory pages > >>> usually have a refcount of 0 when allocated, and would be incorrectly > >>> freed page-by-page if used with defio. That free'ing corrupts the memory > >>> free lists and Linux eventually panics. Simply bumping the refcount after > >>> allocation doesn’t work because when the framebuffer memory is freed, > >>> __free_pages() complains about non-zero refcounts. > >>> > >>> Commit 37b4837959cb ("video: deferred io with physically contiguous > >>> memory") from the year 2008 purported to add support for contiguous > >>> kernel memory framebuffers. The motivating device, sh_mobile_lcdcfb, uses > >>> dma_alloc_coherent() to allocate framebuffer memory, which is likely to > >>> use alloc_pages(). It's unclear to me how this commit actually worked at > >>> the time, unless dma_alloc_coherent() was pulling from a CMA pool instead > >>> of alloc_pages(). Or perhaps alloc_pages() worked differently or on the > >>> arm32 architecture on which sh_mobile_lcdcfb is used. > >>> > >>> In any case, for x86 and arm64 today, commit 37b4837959cb9 is not > >>> sufficient to support contiguous kernel memory framebuffers. The problem > >>> can be seen with the hyperv_fb driver, which may allocate the framebuffer > >>> memory using vmalloc() or alloc_pages(), depending on the configuration > >>> of the Hyper-V guest VM (Gen 1 vs. Gen 2) and the size of the framebuffer. > >>> > >>> Fix this limitation by adding defio support for contiguous kernel memory > >>> framebuffers. A driver with a framebuffer allocated from contiguous > >>> kernel memory must set the FBINFO_KMEMFB flag to indicate such. > >>> > >>> Tested with the hyperv_fb driver in both configurations -- with a > >>> vmalloc() > >>> framebuffer and with an alloc_pages() framebuffer on x86. Also verified a > >>> vmalloc() framebuffer on arm64. Hardware is not available to me to verify > >>> that the older arm32 devices still work correctly, but the path for > >>> vmalloc() framebuffers is essentially unchanged. > >>> > >>> Even with these changes, defio does not support framebuffers in MMIO > >>> space, as defio code depends on framebuffer memory pages having > >>> corresponding 'struct page's. > >>> > >>> Fixes: 3a6fb6c4255c ("video: hyperv: hyperv_fb: Use physical memory for > >>> fb on HyperV > Gen 1 VMs.") > >>> Signed-off-by: Michael Kelley <mhkli...@outlook.com> > >>> --- > >>> Changes in v3: > >>> * Moved definition of FBINFO_KMEMFB flag to a separate patch > >>> preceeding this one in the patch set [Helge Deller] > >>> Changes in v2: > >>> * Tweaked code comments regarding framebuffers allocated with > >>> dma_alloc_coherent() [Christoph Hellwig] > >>> > >>> drivers/video/fbdev/core/fb_defio.c | 128 +++++++++++++++++++++++----- > >>> 1 file changed, 108 insertions(+), 20 deletions(-) > >>> > >>> diff --git a/drivers/video/fbdev/core/fb_defio.c > >>> b/drivers/video/fbdev/core/fb_defio.c > >>> index 4fc93f253e06..f8ae91a1c4df 100644 > >>> --- a/drivers/video/fbdev/core/fb_defio.c > >>> +++ b/drivers/video/fbdev/core/fb_defio.c > >>> @@ -8,11 +8,40 @@ > >>> * for more details. > >>> */ > >>> > >>> +/* > >>> + * Deferred I/O ("defio") allows framebuffers that are mmap()'ed to user > >>> space > >>> + * to batch user space writes into periodic updates to the underlying > >>> + * framebuffer hardware or other implementation (such as with a > >>> virtualized > >>> + * framebuffer in a VM). At each batch interval, a callback is invoked > >>> in the > >>> + * framebuffer's kernel driver, and the callback is supplied with a list > >>> of > >>> + * pages that have been modified in the preceding interval. The callback > >>> can > >>> + * use this information to update the framebuffer hardware as necessary. > >>> The > >>> + * batching can improve performance and reduce the overhead of updating > >>> the > >>> + * hardware. > >>> + * > >>> + * Defio is supported on framebuffers allocated using vmalloc() and > >>> allocated > >>> + * as contiguous kernel memory using alloc_pages() or kmalloc(). These > >>> + * memory allocations all have corresponding "struct page"s. Framebuffers > >>> + * allocated using dma_alloc_coherent() should not be used with defio. > >>> + * Such allocations should be treated as a black box owned by the DMA > >>> + * layer, and should not be deconstructed into individual pages as defio > >>> + * does. Framebuffers in MMIO space are *not* supported because MMIO > >>> space > >>> + * does not have corrresponding "struct page"s. > >>> + * > >>> + * For framebuffers allocated using vmalloc(), struct fb_info must have > >>> + * "screen_buffer" set to the vmalloc address of the framebuffer. For > >>> + * framebuffers allocated from contiguous kernel memory, FBINFO_KMEMFB > >>> must > >>> + * be set, and "fix.smem_start" must be set to the physical address of > >>> the > >>> + * frame buffer. In both cases, "fix.smem_len" must be set to the > >>> framebuffer > >>> + * size in bytes. > >>> + */ > >>> + > >>> #include <linux/module.h> > >>> #include <linux/kernel.h> > >>> #include <linux/errno.h> > >>> #include <linux/string.h> > >>> #include <linux/mm.h> > >>> +#include <linux/pfn_t.h> > >>> #include <linux/vmalloc.h> > >>> #include <linux/delay.h> > >>> #include <linux/interrupt.h> > >>> @@ -37,7 +66,7 @@ static struct page *fb_deferred_io_get_page(struct > >>> fb_info *info, unsigned long > >>> else if (info->fix.smem_start) > >>> page = pfn_to_page((info->fix.smem_start + offs) >> > >>> PAGE_SHIFT); > >>> > >>> - if (page) > >>> + if (page && !(info->flags & FBINFO_KMEMFB)) > >>> get_page(page); > >>> > >>> return page; > >>> @@ -137,6 +166,15 @@ static vm_fault_t fb_deferred_io_fault(struct > >>> vm_fault *vmf) > >>> > >>> BUG_ON(!info->fbdefio->mapping); > >>> > >>> + if (info->flags & FBINFO_KMEMFB) > >>> + /* > >>> + * In this path, the VMA is marked VM_PFNMAP, so mm assumes > >>> + * there is no struct page associated with the page. The > >>> + * PFN must be directly inserted and the created PTE will be > >>> + * marked "special". > >>> + */ > >>> + return vmf_insert_pfn(vmf->vma, vmf->address, > >>> page_to_pfn(page)); > >>> + > >>> vmf->page = page; > >>> return 0; > >>> } > >>> @@ -163,13 +201,14 @@ EXPORT_SYMBOL_GPL(fb_deferred_io_fsync); > >>> > >>> /* > >>> * Adds a page to the dirty list. Call this from struct > >>> - * vm_operations_struct.page_mkwrite. > >>> + * vm_operations_struct.page_mkwrite or .pfn_mkwrite. > >>> */ > >>> -static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, > >>> unsigned long offset, > >>> +static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, struct > >>> vm_fault *vmf, > >>> struct page *page) > >>> { > >>> struct fb_deferred_io *fbdefio = info->fbdefio; > >>> struct fb_deferred_io_pageref *pageref; > >>> + unsigned long offset = vmf->pgoff << PAGE_SHIFT; > >>> vm_fault_t ret; > >>> > >>> /* protect against the workqueue changing the page list */ > >>> @@ -182,20 +221,34 @@ static vm_fault_t fb_deferred_io_track_page(struct > >>> fb_info *info, unsigned long > >>> } > >>> > >>> /* > >>> - * We want the page to remain locked from ->page_mkwrite until > >>> - * the PTE is marked dirty to avoid mapping_wrprotect_range() > >>> - * being called before the PTE is updated, which would leave > >>> - * the page ignored by defio. > >>> - * Do this by locking the page here and informing the caller > >>> - * about it with VM_FAULT_LOCKED. > >>> + * The PTE must be marked writable before the defio deferred work runs > >>> + * again and potentially marks the PTE write-protected. If the order > >>> + * should be switched, the PTE would become writable without defio > >>> + * tracking the page, leaving the page forever ignored by defio. > >>> + * > >>> + * For vmalloc() framebuffers, the associated struct page is locked > >>> + * before releasing the defio lock. mm will later mark the PTE writaable > >>> + * and release the struct page lock. The struct page lock prevents > >>> + * the page from being prematurely being marked write-protected. > >>> + * > >>> + * For FBINFO_KMEMFB framebuffers, mm assumes there is no struct page, > >>> + * so the PTE must be marked writable while the defio lock is held. > >>> */ > >>> - lock_page(pageref->page); > >>> + if (info->flags & FBINFO_KMEMFB) { > >>> + unsigned long pfn = page_to_pfn(pageref->page); > >>> + > >>> + ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, > >>> + __pfn_to_pfn_t(pfn, > >>> PFN_SPECIAL)); > >> > >> Will the VMA have VM_PFNMAP or VM_MIXEDMAP set? PFN_SPECIAL is a > >> horrible hack. > >> > >> In another thread, you mention that you use PFN_SPECIAL to bypass the > >> check in vm_mixed_ok(), so VM_MIXEDMAP is likely not set? > > > > The VMA has VM_PFNMAP set, not VM_MIXEDMAP. It seemed like > > VM_MIXEDMAP is somewhat of a superset of VM_PFNMAP, but maybe that's > > a wrong impression. > > VM_PFNMAP: nothing is refcounted except anon pages > > VM_MIXEDMAP: anything with a "struct page" (pfn_valid()) is refcounted > > pte_special() is a way for GUP-fast to distinguish these refcounted (can > GUP) from non-refcounted (camnnot GUP) pages mapped by PTEs without any > locks or the VMA being available. > > Setting pte_special() in VM_MIXEDMAP on ptes that have a "struct page" > (pfn_valid()) is likely very bogus.
OK, good to know. > > > vm_mixed_ok() does a thorough job of validating the > > use of __vm_insert_mixed(), and since what I did was allowed, I thought > > perhaps it was OK. Your feedback has set me straight, and that's what I > > needed. :-) > > What exactly are you trying to achieve? :) > > If it's mapping a page with a "struct page" and *not* refcounting it, > then vmf_insert_pfn() is the current way to achieve that in a VM_PFNMAP > mapping. It will set pte_special() automatically for you. > Yes, that's what I'm using to initially create the special PTE in the .fault callback. > > > > But the whole approach is moot with Alistair Popple's patch set that > > eliminates pfn_t. Is there an existing mm API that will do mkwrite on a > > special PTE in a VM_PFNMAP VMA? I didn't see one, but maybe I missed > > it. If there's not one, I'll take a crack at adding it in the next version > > of my > > patch set. > > I assume you'd want vmf_insert_pfn_mkwrite(), correct? Probably > vmf_insert_pfn_prot() can be used by adding PAGE_WRITE to pgprot. (maybe > :) ) Ok, I'll look at that more closely. The sequence is that the special PTE gets created with vmf_insert_pfn(). Then when the page is first written to, the .pfn_mkwrite callback is invoked by mm. The question is the best way for that callback to mark the existing PTE as writable. Thanks, Michael