From: "Kiryl Shutsemau (Meta)" <[email protected]> This series adds userfaultfd support for tracking the working set of VM guest memory, so a VMM can identify hot pages and reclaim cold ones to tiered or remote storage.
v1: https://lore.kernel.org/all/[email protected]/ v2: https://lore.kernel.org/all/[email protected]/ v3: https://lore.kernel.org/all/[email protected]/ v4: https://lore.kernel.org/all/[email protected]/ v5: https://lore.kernel.org/all/[email protected]/ v6: https://lore.kernel.org/all/[email protected]/ == Changes since v6 == - Rebased onto v7.2-rc1. v6 was stacked on the separate "userfaultfd/pagemap: pre-existing fixes" series; those fixes have since landed, so v7 applies directly on v7.2-rc1 with no out-of-tree dependency. The only rebase adaptation is that the accessor-rename patch now also covers two call sites in remove_migration_pmd() that appeared in v7.2-rc1. - Addressed Lorenzo Stoakes' review: the two VM_UFFD_RWP single-flag checks -- userfaultfd_rwp() and gup_can_follow_protnone() -- use vma_test_single_mask() instead of vma_test_any_mask(). - Reworked the working-set documentation and the PAGEMAP_SCAN selftest around hot-page detection (PAGE_IS_ACCESSED, non-inverted) rather than a cold scan. A cold scan (inverted PAGE_IS_ACCESSED) cannot see file pages that are present in the page cache but not mapped, so it misreports never-faulted regions of a pre-populated file as cold. Tracking the hot set and reclaiming everything else from the backing file is the correct model and is now what the docs and tests show. - Documented a related file-THP limitation: RWP state is PTE-granular, but a file mapping faulted in as a PMD-level THP loses its RWP marks when the PMD is split -- split_huge_pmd() clears the file PMD via pmdp_huge_clear_flush() without redistributing the uffd marker to the PTEs (there is no PMD-level RWP marker), so the range silently reverts to untracked. Cold-page tracking over a transparently-huge file mapping is therefore unreliable; the selftest opts out of THP (MADV_NOHUGEPAGE) on the non-hugetlb backings, and the documented hot-set-plus-file-reclaim model sidesteps it. hugetlb is unaffected (its mappings are not split by split_huge_pmd()). - Collected review tags picked up during v6. 113/113 of tools/testing/selftests/mm/uffd-unit-tests pass on v7.2-rc1 (46 RWP cases plus the existing UFFD groups, no regressions). == Problem == A VMM managing guest memory needs to: 1. detect which pages are still being touched (working-set tracking); 2. safely reclaim cold pages to slower tiered or remote storage; 3. fetch them back on demand when accessed again. == Approach == UFFDIO_REGISTER_MODE_RWP is a new userfaultfd registration mode, in parallel with the existing MODE_MISSING / MODE_WP / MODE_MINOR. It uses the same mechanism on every backing -- anon, shmem, hugetlbfs: - PAGE_NONE on the PTE (the same primitive NUMA balancing uses) makes the page inaccessible while keeping it resident; - the uffd PTE bit (the one MODE_WP already owns) marks the entry as "userfaultfd-tracked" so the protnone fault path can tell an RWP fault apart from an mprotect(PROT_NONE) or NUMA hinting fault. VM_UFFD_WP and VM_UFFD_RWP are mutually exclusive per VMA, so the same PTE bit safely carries both meanings depending on the registered VMA flag. In sync mode, the kernel delivers a UFFD_PAGEFAULT_FLAG_RWP message to the registered handler, and the handler resolves the fault with UFFDIO_RWPROTECT clearing MODE_RWP. In async mode (UFFD_FEATURE_RWP_ASYNC), the fault is auto-resolved in-place: the kernel restores the original PTE permissions and the faulting thread continues without a userfaultfd message ever being delivered. Userspace then learns which pages were touched during the cycle by reading PAGE_IS_ACCESSED out of PAGEMAP_SCAN -- that set is the working set; everything else is a reclaim candidate. UFFDIO_RWPROTECT is the protect/unprotect ioctl, mirroring UFFDIO_WRITEPROTECT. UFFDIO_SET_MODE flips RWP_ASYNC <-> sync at runtime under mmap_write_lock() + vma_start_write(), so a VMM can run in async mode for detection and switch to sync for race-free reclaim without re-registering the userfaultfd. == Typical VMM workflow == /* arm */ UFFDIO_API(features = RWP | RWP_ASYNC) UFFDIO_REGISTER(MODE_RWP) /* detection cycle (async) */ UFFDIO_RWPROTECT(range, RWP) sleep(interval) /* freeze the snapshot before scanning */ UFFDIO_SET_MODE(disable = RWP_ASYNC) /* sync */ PAGEMAP_SCAN(PAGE_IS_ACCESSED) -> hot pages (working set) /* reclaim everything not in the hot set from the backing file */ fallocate(FALLOC_FL_PUNCH_HOLE, non-hot) /* or pwrite to remote */ UFFDIO_SET_MODE(enable = RWP_ASYNC) /* resume */ == Series layout == Patches 1 to 3 are preparatory: 1: decouple protnone helpers from CONFIG_NUMA_BALANCING. 2-3: rename _PAGE_BIT_UFFD_WP, pte_uffd_wp() and friends to drop the _WP suffix, since the bit now carries WP and RWP meaning depending on the VMA flag. The SCAN_PTE_UFFD enum's ftrace output string is intentionally kept as "pte_uffd_wp" so trace-based tooling does not silently break. Patch 4 switches the uffd VMA-flag helpers to the vma_flags_t accessors (vma_test_*_mask), so the VMA_UFFD_* masks are the single place that knows which modes the build offers. Patches 5 to 8 add the in-kernel mechanism: 5: VM_UFFD_RWP VMA flag (aliased to VM_NONE until patch 9 introduces CONFIG_USERFAULTFD_RWP together with the UAPI). 6: MM_CP_UFFD_RWP change_protection() primitive (PAGE_NONE + uffd bit, plus a RESOLVE counterpart). 7: marker preservation across swap, device-exclusive, migration, fork, mremap, UFFDIO_MOVE, hugetlb copy, and mprotect(). 8: handle VM_UFFD_RWP in khugepaged, rmap, and GUP. Patches 9 to 13 wire the userspace surface: 9: UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT plumbing (introduces CONFIG_USERFAULTFD_RWP). 10: RWP fault delivery and exposure of UFFDIO_REGISTER_MODE_RWP. 11: PAGE_IS_ACCESSED in PAGEMAP_SCAN. 12: UFFD_FEATURE_RWP_ASYNC for async fault resolution. 13: UFFDIO_SET_MODE for runtime sync/async toggle. Patches 14 and 15 are kernel tests and Documentation/. The matching man-pages series is already upstream. Kiryl Shutsemau (Meta) (15): mm: decouple protnone helpers from CONFIG_NUMA_BALANCING mm: rename uffd-wp PTE bit macros to uffd mm: rename uffd-wp PTE accessors to uffd userfaultfd: test uffd VMA flags through the vma_flags_t API mm: add VM_UFFD_RWP VMA flag mm: add MM_CP_UFFD_RWP change_protection() flag mm: preserve RWP marker across PTE rewrites mm: handle VM_UFFD_RWP in khugepaged, rmap, and GUP userfaultfd: add UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT plumbing mm/userfaultfd: add RWP fault delivery and expose UFFDIO_REGISTER_MODE_RWP mm/pagemap: add PAGE_IS_ACCESSED for RWP tracking userfaultfd: add UFFD_FEATURE_RWP_ASYNC for async fault resolution userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle selftests/mm: add userfaultfd RWP tests Documentation/userfaultfd: document RWP working set tracking Documentation/admin-guide/mm/pagemap.rst | 13 +- Documentation/admin-guide/mm/userfaultfd.rst | 269 ++++++- Documentation/filesystems/proc.rst | 1 + arch/arm64/Kconfig | 1 + arch/arm64/include/asm/pgtable-prot.h | 8 +- arch/arm64/include/asm/pgtable.h | 47 +- arch/loongarch/Kconfig | 1 + arch/loongarch/include/asm/pgtable.h | 4 +- arch/powerpc/include/asm/book3s/64/pgtable.h | 8 +- arch/powerpc/platforms/Kconfig.cputype | 1 + arch/riscv/Kconfig | 1 + arch/riscv/include/asm/pgtable-bits.h | 12 +- arch/riscv/include/asm/pgtable.h | 59 +- arch/s390/Kconfig | 1 + arch/s390/include/asm/hugetlb.h | 12 +- arch/s390/include/asm/pgtable.h | 4 +- arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable.h | 56 +- arch/x86/include/asm/pgtable_types.h | 16 +- fs/proc/task_mmu.c | 98 ++- include/asm-generic/hugetlb.h | 18 +- include/asm-generic/pgtable_uffd.h | 32 +- include/linux/huge_mm.h | 7 + include/linux/leafops.h | 4 +- include/linux/mm.h | 65 +- include/linux/mm_inline.h | 4 +- include/linux/pgtable.h | 32 +- include/linux/swapops.h | 4 +- include/linux/userfaultfd_k.h | 89 ++- include/trace/events/huge_memory.h | 2 +- include/trace/events/mmflags.h | 7 + include/uapi/linux/fs.h | 1 + include/uapi/linux/userfaultfd.h | 54 +- init/Kconfig | 8 + mm/Kconfig | 9 + mm/debug_vm_pgtable.c | 4 +- mm/huge_memory.c | 161 ++-- mm/hugetlb.c | 158 +++- mm/internal.h | 4 +- mm/khugepaged.c | 40 +- mm/memory.c | 135 +++- mm/migrate.c | 20 +- mm/migrate_device.c | 8 +- mm/mprotect.c | 70 +- mm/mremap.c | 17 +- mm/page_table_check.c | 8 +- mm/rmap.c | 18 +- mm/swapfile.c | 9 +- mm/userfaultfd.c | 387 ++++++++- tools/include/uapi/linux/fs.h | 1 + tools/testing/selftests/mm/uffd-unit-tests.c | 781 +++++++++++++++++++ 51 files changed, 2333 insertions(+), 437 deletions(-) base-commit: dc59e4fea9d83f03bad6bddf3fa2e52491777482 -- 2.54.0

