On 6/5/26 18:14, Nico Pache wrote:
> Pass an order to collapse_huge_page to support collapsing anon memory to
> arbitrary orders within a PMD. order indicates what mTHP size we are
> attempting to collapse to.
>
> For non-PMD collapse we must leave the anon VMA write locked until after
> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> the mTHP case this is not true, and we must keep the lock to prevent
> access/changes to the page tables. This can happen if the rmap walkers hit
> a pmd_none while the PMD entry is currently unavailable due to being
> temporarily removed during the collapse phase.
>
> To properly establish the page table hierarchy without violating any
> expectations from certain architectures (e.g. MIPS), we must make sure to
> have the PMD reinstalled before the PTEs, and hold both PTE/PMD locks
> before calling update_mmu_cache_range() (if they are distinct locks).
>
> Signed-off-by: Nico Pache <[email protected]>
> ---
[...]
> */
> __folio_mark_uptodate(folio);
> - pgtable = pmd_pgtable(_pmd);
> -
> spin_lock(pmd_ptl);
> - BUG_ON(!pmd_none(*pmd));
> - pgtable_trans_huge_deposit(mm, pmd, pgtable);
> - map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> + VM_WARN_ON_ONCE(!pmd_none(*pmd));
> + if (is_pmd_order(order)) {
> + pgtable = pmd_pgtable(_pmd);
> + pgtable_trans_huge_deposit(mm, pmd, pgtable);
> + map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> + } else {
> + /*
> + * Some architectures (e.g. MIPS) walk the live page table in
> + * their implementation. update_mmu_cache_range() must be called
> + * with a valid page table hierarchy and the PTE lock held.
> + * Acquire it nested inside pmd_ptl when they are distinct
> locks.
> + */
> + if (pte_ptl != pmd_ptl)
> + spin_lock_nested(pte_ptl, SINGLE_DEPTH_NESTING);
> + pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> + map_anon_folio_pte_nopf(folio, pte, vma, start_addr,
> + /*uffd_wp=*/ false);
> + if (pte_ptl != pmd_ptl)
> + spin_unlock(pte_ptl);
> + }
> spin_unlock(pmd_ptl);
>
> folio = NULL;
>
> result = SCAN_SUCCEED;
> out_up_write:
> + if (anon_vma_locked)
> + anon_vma_unlock_write(vma->anon_vma);
> + if (pte)
> + pte_unmap(pte);
We re-enable some page table walkers before we unmap the PTE.
We still hold the mmap lock in write mode, so nothing would currently try
reclaiming the page table concurrently.
So I guess this works right now, but we should likely rework that code later to
either revert both statements. Or maybe we can simply unmap like we did, and
simply remap before we call map_anon_folio_pte_nopf()? Remapping should not
fail.
Alternatively to an unmap+remap, I think we could also unmap earlier for PMD
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6de935e76ceb..ba2a2508dda6 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1378,6 +1378,8 @@ static enum scan_result collapse_huge_page(struct
mm_struct *mm, unsigned long s
if (is_pmd_order(order)) {
anon_vma_unlock_write(vma->anon_vma);
anon_vma_locked = false;
+ pte_unmap(pte);
+ pte = NULL;
}
result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
But this can also be handled later.
We now hold an anon_vma lock a bit longer for !pmd-collapse. But there is also
less to copy. If that bites us, we can try optimizing later.
So after another skim, I think this patch is ready for primetime. We can address
the things mentioned above later ... and any fallout can be fixed later, if any.
Acked-by: David Hildenbrand (Arm) <[email protected]>
--
Cheers,
David