On Wed, Apr 29, 2026 at 11:29:12AM -0400, Zi Yan wrote:
>This check ensures the correctness of read-only PMD folio collapse
>after it is enabled for all FSes supporting PMD pagecache folios and
>replaces READ_ONLY_THP_FOR_FS.
>
>READ_ONLY_THP_FOR_FS only supports read-only fd and uses mapping->nr_thps
>and inode->i_writecount to prevent any write to read-only to-be-collapsed
>folios. In upcoming commits, READ_ONLY_THP_FOR_FS will be removed and the
>aforementioned mechanism will go away too. To ensure khugepaged functions
>as expected after the changes, skip if any folio is dirty after
>try_to_unmap(), since a dirty folio at that point means this read-only
>folio can get writes between try_to_unmap() and try_to_unmap_flush() via
>cached TLB entries and khugepaged does not support writable pagecache folio
>collapse yet.
>
>Signed-off-by: Zi Yan <[email protected]>
>Reviewed-by: Baolin Wang <[email protected]>
>Acked-by: David Hildenbrand (Arm) <[email protected]>
>---
> mm/khugepaged.c | 28 ++++++++++++++++++++++++----
> 1 file changed, 24 insertions(+), 4 deletions(-)
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index 6808f2b48d864..71209a72195ab 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -2327,8 +2327,7 @@ static enum scan_result collapse_file(struct mm_struct 
>*mm, unsigned long addr,
>                               }
>                       } else if (folio_test_dirty(folio)) {
>                               /*
>-                               * khugepaged only works on read-only fd,
>-                               * so this page is dirty because it hasn't
>+                               * This page is dirty because it hasn't
>                                * been flushed since first write. There
>                                * won't be new dirty pages.
>                                *
>@@ -2386,8 +2385,8 @@ static enum scan_result collapse_file(struct mm_struct 
>*mm, unsigned long addr,
>               if (!is_shmem && (folio_test_dirty(folio) ||
>                                 folio_test_writeback(folio))) {
>                       /*
>-                       * khugepaged only works on read-only fd, so this
>-                       * folio is dirty because it hasn't been flushed
>+                       * khugepaged only works on clean file-backed folios,
>+                       * so this folio is dirty because it hasn't been flushed
>                        * since first write.
>                        */
>                       result = SCAN_PAGE_DIRTY_OR_WRITEBACK;
>@@ -2431,6 +2430,27 @@ static enum scan_result collapse_file(struct mm_struct 
>*mm, unsigned long addr,
>                       goto out_unlock;
>               }
> 
>+              /*
>+               * At this point, the folio is locked and unmapped. If the PTE
>+               * was dirty, try_to_unmap() has transferred the dirty bit to
>+               * the folio and we must not collapse it into a clean
>+               * file-backed folio.
>+               *
>+               * If the folio is clean here, no one can write it until we
>+               * drop the folio lock. A write through a stale TLB entry came
>+               * from a clean PTE and must fault because the PTE has been
>+               * cleared; the fault path has to take the folio lock before

Yeah, try_to_unmap_one() also already documents the required arch
guarantee for a clean cached TLB entry after the PTE is cleared.

                        /*
                         * We clear the PTE but do not flush so potentially
                         * a remote CPU could still be writing to the folio.
                         * If the entry was previously clean then the
                         * architecture must guarantee that a clear->dirty
                         * transition on a cached TLB entry is written through
                         * and traps if the PTE is unmapped.
                         */

Lesson learned :)

>+               * installing a writable mapping. Buffered write paths also
>+               * have to take the folio lock before modifying file contents
>+               * without a mapping, typically via write_begin_get_folio().
>+               */
>+              if (!is_shmem && folio_test_dirty(folio)) {
>+                      result = SCAN_PAGE_DIRTY_OR_WRITEBACK;
>+                      xas_unlock_irq(&xas);
>+                      folio_putback_lru(folio);
>+                      goto out_unlock;
>+              }

LGTM.
Reviewed-by: Lance Yang <[email protected]>

Reply via email to