On Tue, Jan 27, 2026 at 09:29:19PM +0200, Mike Rapoport wrote: > From: "Mike Rapoport (Microsoft)" <[email protected]> > > Hi, > > These patches enable support for userfaultfd in guest_memfd. > They are quite different from the latest posting [1] so I'm restarting the > versioning. As there was a lot of tension around the topic, this is an RFC > to get some feedback and see how we can move forward. > > As the ground work I refactored userfaultfd handling of PTE-based memory types > (anonymous and shmem) and converted them to use vm_uffd_ops for allocating a > folio or getting an existing folio from the page cache. shmem also implements > callbacks that add a folio to the page cache after the data passed in > UFFDIO_COPY was copied and remove the folio from the page cache if page table > update fails. > > In order for guest_memfd to notify userspace about page faults, there are new > VM_FAULT_UFFD_MINOR and VM_FAULT_UFFD_MISSING that a ->fault() handler can > return to inform the page fault handler that it needs to call > handle_userfault() to complete the fault. > > Nikita helped to plumb these new goodies into guest_memfd and provided basic > tests to verify that guest_memfd works with userfaultfd. > > I deliberately left hugetlb out, at least for the most part. > hugetlb handles acquisition of VMA and more importantly establishing of parent > page table entry differently than PTE-based memory types. This is a different > abstraction level than what vm_uffd_ops provides and people objected to > exposing such low level APIs as a part of VMA operations. > > Also, to enable uffd in guest_memfd refactoring of hugetlb is not needed and I > prefer to delay it until the dust settles after the changes in this set. > > [1] https://lore.kernel.org/all/[email protected] > > Mike Rapoport (Microsoft) (12): > userfaultfd: introduce mfill_copy_folio_locked() helper > userfaultfd: introduce struct mfill_state > userfaultfd: introduce mfill_get_pmd() helper. > userfaultfd: introduce mfill_get_vma() and mfill_put_vma() > userfaultfd: retry copying with locks dropped in mfill_atomic_pte_copy() > userfaultfd: move vma_can_userfault out of line > userfaultfd: introduce vm_uffd_ops > userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE > userfaultfd: introduce vm_uffd_ops->alloc_folio() > shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops > userfaultfd: mfill_atomic() remove retry logic > mm: introduce VM_FAULT_UFFD_MINOR fault reason > > Nikita Kalyazin (5): > mm: introduce VM_FAULT_UFFD_MISSING fault reason > KVM: guest_memfd: implement userfaultfd minor mode > KVM: guest_memfd: implement userfaultfd missing mode > KVM: selftests: test userfaultfd minor for guest_memfd > KVM: selftests: test userfaultfd missing for guest_memfd > > include/linux/mm.h | 5 + > include/linux/mm_types.h | 15 +- > include/linux/shmem_fs.h | 14 - > include/linux/userfaultfd_k.h | 74 +- > mm/hugetlb.c | 21 + > mm/memory.c | 8 +- > mm/shmem.c | 188 +++-- > mm/userfaultfd.c | 671 ++++++++++-------- > .../testing/selftests/kvm/guest_memfd_test.c | 191 +++++ > virt/kvm/guest_memfd.c | 134 +++- > 10 files changed, 871 insertions(+), 450 deletions(-)
Mike, The idea looks good to me, thanks for this work! Your process on UFFDIO_COPY over anon/shmem is nice to me. If you remember, I used to raise a concern on introducing two new fault retvals only for userfaultfd: https://lore.kernel.org/all/[email protected]/ IMHO they're not only unnecessarily leaking userfaultfd information into fault core definitions, but also cause code duplications. I still think we should avoid them. This time, I've attached a smoke tested patch removing both of them. It's pretty small and it runs all fine with all old/new userfaultfd tests (including gmem ones). Feel free to have a look at the end. I understand you want to avoid adding mnore complexity to this series, if you want I can also prepare such a patch after this series landed to remove the two retvals. I'd still would like to know how you think about it, though, let me know if you have any comments. Note that it may indeed need some perf tests to make sure there's zero overhead after this change. Currently there's still some trivial overheads (e.g. unnecessary folio locks), but IIUC we can even avoid that. Thanks, ===8<=== >From 5379d084494b17281f3e5365104a7edbdbe53759 Mon Sep 17 00:00:00 2001 From: Peter Xu <[email protected]> Date: Tue, 3 Feb 2026 15:07:58 -0500 Subject: [PATCH] mm/userfaultfd: Remove two userfaultfd fault retvals They're not needed when with vm_uffd_ops. We can remove both of them. Actually, another side benefit is drivers do not need to process userfaultfd missing / minor faults anymore in the main fault handler. This patch will make get_folio_noalloc() required for either MISSING or MINOR fault, but that's not a problem, as it should be lightweight and the current only outside-mm user (gmem) will support both anyway. Signed-off-by: Peter Xu <[email protected]> --- include/linux/mm_types.h | 15 +----------- include/linux/userfaultfd_k.h | 2 +- mm/memory.c | 45 +++++++++++++++++++++++++++++------ mm/shmem.c | 12 ---------- virt/kvm/guest_memfd.c | 20 ---------------- 5 files changed, 40 insertions(+), 54 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index a6d32470a78a3..3cc8ae7228860 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1612,10 +1612,6 @@ typedef __bitwise unsigned int vm_fault_t; * fsync() to complete (for synchronous page faults * in DAX) * @VM_FAULT_COMPLETED: ->fault completed, meanwhile mmap lock released - * @VM_FAULT_UFFD_MINOR: ->fault did not modify page tables and needs - * handle_userfault(VM_UFFD_MINOR) to complete - * @VM_FAULT_UFFD_MISSING: ->fault did not modify page tables and needs - * handle_userfault(VM_UFFD_MISSING) to complete * @VM_FAULT_HINDEX_MASK: mask HINDEX value * */ @@ -1633,13 +1629,6 @@ enum vm_fault_reason { VM_FAULT_DONE_COW = (__force vm_fault_t)0x001000, VM_FAULT_NEEDDSYNC = (__force vm_fault_t)0x002000, VM_FAULT_COMPLETED = (__force vm_fault_t)0x004000, -#ifdef CONFIG_USERFAULTFD - VM_FAULT_UFFD_MINOR = (__force vm_fault_t)0x008000, - VM_FAULT_UFFD_MISSING = (__force vm_fault_t)0x010000, -#else - VM_FAULT_UFFD_MINOR = (__force vm_fault_t)0x000000, - VM_FAULT_UFFD_MISSING = (__force vm_fault_t)0x000000, -#endif VM_FAULT_HINDEX_MASK = (__force vm_fault_t)0x0f0000, }; @@ -1664,9 +1653,7 @@ enum vm_fault_reason { { VM_FAULT_FALLBACK, "FALLBACK" }, \ { VM_FAULT_DONE_COW, "DONE_COW" }, \ { VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }, \ - { VM_FAULT_COMPLETED, "COMPLETED" }, \ - { VM_FAULT_UFFD_MINOR, "UFFD_MINOR" }, \ - { VM_FAULT_UFFD_MISSING, "UFFD_MISSING" } + { VM_FAULT_COMPLETED, "COMPLETED" } struct vm_special_mapping { const char *name; /* The name, e.g. "[vdso]". */ diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 75d5b09f2560c..5923e32de53b5 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -85,7 +85,7 @@ struct vm_uffd_ops { /* Checks if a VMA can support userfaultfd */ bool (*can_userfault)(struct vm_area_struct *vma, vm_flags_t vm_flags); /* - * Called to resolve UFFDIO_CONTINUE request. + * Required by any uffd driver for either MISSING or MINOR fault. * Should return the folio found at pgoff in the VMA's pagecache if it * exists or ERR_PTR otherwise. * The returned folio is locked and with reference held. diff --git a/mm/memory.c b/mm/memory.c index 456344938c72b..098febb761acc 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5338,6 +5338,33 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) return VM_FAULT_OOM; } +static vm_fault_t fault_process_userfaultfd(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + struct inode *inode = file_inode(vma->vm_file); + /* + * NOTE: we could double check this hook present when + * UFFDIO_REGISTER on MISSING or MINOR for a file driver. + */ + struct folio *folio = + vma->vm_ops->uffd_ops->get_folio_noalloc(inode, vmf->pgoff); + + if (!IS_ERR_OR_NULL(folio)) { + /* + * TODO: provide a flag for get_folio_noalloc() to avoid + * locking (or even the extra reference?) + */ + folio_unlock(folio); + folio_put(folio); + if (userfaultfd_minor(vma)) + return handle_userfault(vmf, VM_UFFD_MINOR); + } else { + return handle_userfault(vmf, VM_UFFD_MISSING); + } + + return 0; +} + /* * The mmap_lock must have been held on entry, and may have been * released depending on flags and vma->vm_ops->fault() return value. @@ -5370,16 +5397,20 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) return VM_FAULT_OOM; } + /* + * If this is an userfaultfd trap, process it in advance before + * triggering the genuine fault handler. + */ + if (userfaultfd_missing(vma) || userfaultfd_minor(vma)) { + ret = fault_process_userfaultfd(vmf); + if (ret) + return ret; + } + ret = vma->vm_ops->fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY | - VM_FAULT_DONE_COW | VM_FAULT_UFFD_MINOR | - VM_FAULT_UFFD_MISSING))) { - if (ret & VM_FAULT_UFFD_MINOR) - return handle_userfault(vmf, VM_UFFD_MINOR); - if (ret & VM_FAULT_UFFD_MISSING) - return handle_userfault(vmf, VM_UFFD_MISSING); + VM_FAULT_DONE_COW))) return ret; - } folio = page_folio(vmf->page); if (unlikely(PageHWPoison(vmf->page))) { diff --git a/mm/shmem.c b/mm/shmem.c index eafd7986fc2ec..5286f28b3e443 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2484,13 +2484,6 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, fault_mm = vma ? vma->vm_mm : NULL; folio = filemap_get_entry(inode->i_mapping, index); - if (folio && vma && userfaultfd_minor(vma)) { - if (!xa_is_value(folio)) - folio_put(folio); - *fault_type = VM_FAULT_UFFD_MINOR; - return 0; - } - if (xa_is_value(folio)) { error = shmem_swapin_folio(inode, index, &folio, sgp, gfp, vma, fault_type); @@ -2535,11 +2528,6 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, * Fast cache lookup and swap lookup did not find it: allocate. */ - if (vma && userfaultfd_missing(vma)) { - *fault_type = VM_FAULT_UFFD_MISSING; - return 0; - } - /* Find hugepage orders that are allowed for anonymous shmem and tmpfs. */ orders = shmem_allowable_huge_orders(inode, vma, index, write_end, false); if (orders > 0) { diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 14cca057fc0ec..bd0de685f42f8 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -421,26 +421,6 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf) folio = __filemap_get_folio(inode->i_mapping, vmf->pgoff, FGP_LOCK | FGP_ACCESSED, 0); - if (userfaultfd_armed(vmf->vma)) { - /* - * If userfaultfd is registered in minor mode and a folio - * exists, return VM_FAULT_UFFD_MINOR to trigger the - * userfaultfd handler. - */ - if (userfaultfd_minor(vmf->vma) && !IS_ERR_OR_NULL(folio)) { - ret = VM_FAULT_UFFD_MINOR; - goto out_folio; - } - - /* - * Check if userfaultfd is registered in missing mode. If so, - * check if a folio exists in the page cache. If not, return - * VM_FAULT_UFFD_MISSING to trigger the userfaultfd handler. - */ - if (userfaultfd_missing(vmf->vma) && IS_ERR_OR_NULL(folio)) - return VM_FAULT_UFFD_MISSING; - } - /* folio not in the pagecache, try to allocate */ if (IS_ERR(folio)) folio = __kvm_gmem_folio_alloc(inode, vmf->pgoff); -- 2.50.1 -- Peter Xu

