From: Yijia Wang <[email protected]>

check_file_mmap faults a page in the middle of a file mapping and
expects the readahead window to have populated further pages *after*
it. With a large base page size this is wrong and the test fails
reliably, e.g. on arm64 with 64K pages:

  # mincore_selftest.c:260:check_file_mmap:Expected ra_pages (0) > 0 (0)
  # mincore_selftest.c:261:check_file_mmap:No read-ahead pages found in memory
  not ok 4 global.check_file_mmap

The read-around window is read_ahead_kb / page_size pages wide and is
centred on the faulting page (mm/filemap.c):

  ra->start = max(0, pgoff - ra_pages / 2)

With the default 128K budget that is 32 pages on 4K but only 2 pages on
64K, so the window becomes [pgoff-1, pgoff] and the single extra page
lands *before* the faulted page. The forward-only scan finds nothing.
This is not a huge-page (THP) effect; it is just a byte-sized readahead
budget divided by a large page size plus a fault-centred window.

Count readahead pages on both sides of the faulted page, and skip the
readahead check up front when there is no usable read-around window:
either it is narrower than two pages (readahead disabled, or the base
page size so large that read_ahead_kb covers a single page), or the file
has no backing block device whose read_ahead_kb can be read (e.g. on
tmpfs). When the window is wide enough the EXPECT_GT(ra_pages, 0)
assertion is kept, so a kernel that should read ahead but does not is
still caught.

Signed-off-by: Yijia Wang <[email protected]>
---
How the problem was tracked down, step by step:

 1. On arm64 with 64K base pages, check_file_mmap fails:

      # mincore_selftest.c:260:check_file_mmap:Expected ra_pages (0) > 0 (0)
      # mincore_selftest.c:261:check_file_mmap:No read-ahead pages found in 
memory
      not ok 4 global.check_file_mmap

 2. The test faults the middle page and scans only *forward* for the
    readahead pages, so "ra_pages == 0" means nothing was read in after
    the faulted page.

 3. Readahead is a byte budget (read_ahead_kb, default 128K) turned into
    a page count: ra_pages = read_ahead_kb / page_size. That is 32 pages
    on 4K but only 2 pages on 64K.

 4. The mmap read-around window is centred on the fault:
    ra->start = max(0, pgoff - ra_pages / 2). With 2 pages the window is
    [pgoff-1, pgoff] - the one extra page lands *before* the fault.

 5. Measured directly on the 64K box, varying read_ahead_kb, faulting
    page 32:

      read_ahead_kb=64  (1 page):  resident: 32          (no neighbour)
      read_ahead_kb=128 (2 pages): resident: 31 32        (neighbour before)
      read_ahead_kb=192 (3 pages): resident: 31 32 33     (before and after)

    So the neighbour really is before the fault, and a 1-page window
    brings in no neighbour at all. This is not a THP effect.

 6. A file with no backing block device (e.g. tmpfs) has no
    read_ahead_kb to read and no block-device readahead at all; faulting
    the middle page brings in only that page (before=0 after=0). The fix
    therefore skips when the window cannot be determined, not just when
    it is too narrow.

Open question for the list: which fix do you prefer?

  (a) Move the fault towards the start of the mapping so the centred
      window always leaves a page *after* it, keeping the original
      forward-only scan. Minimal, but the page count after the fault is
      ceil(ra_pages/2) - 1, which is 0 whenever ra_pages <= 2; only a
      fault within the first ra_pages/2 pages (e.g. page 0) has a page
      after it. That changes the test's intent - it would exercise
      readahead at the boundary, not in the interior - and still fails
      when read_ahead_kb yields a 1-page window or on tmpfs.

  (b) Scan both sides of the faulted page (this patch), and skip up front
      when there is no usable read-around window. A little more code, but
      page-size-agnostic and it keeps the assertion whenever a neighbour
      is actually expected.

I went with (b); happy to switch to (a) if that is preferred.

An earlier posting of mine changed check_huge_pages instead - that was a
misdiagnosis (on the 64K box that subtest merely SKIPs and was never the
failure). Please disregard it; this version supersedes it.
 .../selftests/mincore/mincore_selftest.c      | 82 ++++++++++++++++++-
 1 file changed, 80 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/mincore/mincore_selftest.c 
b/tools/testing/selftests/mincore/mincore_selftest.c
index cdd022c1c..960851286 100644
--- a/tools/testing/selftests/mincore/mincore_selftest.c
+++ b/tools/testing/selftests/mincore/mincore_selftest.c
@@ -12,6 +12,8 @@
 #include <unistd.h>
 #include <stdlib.h>
 #include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/sysmacros.h>
 #include <string.h>
 #include <fcntl.h>
 
@@ -174,6 +176,48 @@ TEST(check_huge_pages)
 }
 
 
+/*
+ * Return the size of the mmap read-around window, in pages, for the block
+ * device backing the file referred to by @fd, or -1 if it cannot be
+ * determined. The window size is the device's read_ahead_kb divided by the
+ * page size; the kernel centres this window on the faulting page, so it must
+ * be at least two pages wide for any neighbouring page to be read in.
+ */
+static long readahead_window_pages(int fd, long page_size)
+{
+       char path[64];
+       struct stat st;
+       long ra_kb;
+       FILE *f;
+
+       if (fstat(fd, &st))
+               return -1;
+
+       /*
+        * read_ahead_kb lives in the owning disk's queue/ directory. For a
+        * whole-disk device that is the device's own queue/; for a partition
+        * it is one level up ("..") at the parent disk.
+        */
+       snprintf(path, sizeof(path), "/sys/dev/block/%u:%u/queue/read_ahead_kb",
+                major(st.st_dev), minor(st.st_dev));
+       f = fopen(path, "r");
+       if (!f) {
+               snprintf(path, sizeof(path),
+                        "/sys/dev/block/%u:%u/../queue/read_ahead_kb",
+                        major(st.st_dev), minor(st.st_dev));
+               f = fopen(path, "r");
+               if (!f)
+                       return -1;
+       }
+       if (fscanf(f, "%ld", &ra_kb) != 1) {
+               fclose(f);
+               return -1;
+       }
+       fclose(f);
+
+       return (ra_kb << 10) / page_size;
+}
+
 /*
  * Test mincore() behavior on a file-backed page.
  * No pages should be loaded into memory right after the mapping. Then,
@@ -194,6 +238,7 @@ TEST(check_file_mmap)
        int fd;
        int i;
        int ra_pages = 0;
+       long ra_window;
 
        page_size = sysconf(_SC_PAGESIZE);
        vec_size = FILE_SIZE / page_size;
@@ -224,6 +269,22 @@ TEST(check_file_mmap)
                SKIP(goto out_close, "fallocate not supported by filesystem.");
        }
 
+       /*
+        * mmap read-around brings in a window of pages centred on the
+        * faulting page. Its width is the backing device's read_ahead_kb
+        * divided by the page size. If that window is narrower than two
+        * pages - because readahead is disabled, or the base page size is so
+        * large that read_ahead_kb covers a single page - then no
+        * neighbouring page can ever be read in and the readahead part of
+        * this test does not apply, so skip it. The same goes for a file with
+        * no backing block device (e.g. on tmpfs), where the window cannot be
+        * determined and there is no block-device readahead to exercise.
+        */
+       ra_window = readahead_window_pages(fd, page_size);
+       if (ra_window < 2)
+               SKIP(goto out_close,
+                    "no usable readahead window for this configuration.");
+
        /*
         * Map the whole file, the pages shouldn't be fetched yet.
         */
@@ -242,8 +303,11 @@ TEST(check_file_mmap)
        }
 
        /*
-        * Touch a page in the middle of the mapping. We expect the next
-        * few pages (the readahead window) to be populated too.
+        * Touch a page in the middle of the mapping. We expect the
+        * surrounding pages (the readahead window) to be populated too.
+        * The kernel centres the mmap read-around window on the faulting
+        * page, so with a large base page size the readahead pages may
+        * land before the touched page rather than after it.
         */
        addr[FILE_SIZE / 2] = 1;
        retval = mincore(addr, FILE_SIZE, vec);
@@ -252,11 +316,25 @@ TEST(check_file_mmap)
                TH_LOG("Page not found in memory after use");
        }
 
+       /* Count readahead pages that landed before the touched page. */
+       i = FILE_SIZE / 2 / page_size - 1;
+       while (i >= 0 && vec[i]) {
+               ra_pages++;
+               i--;
+       }
+
+       /* Count readahead pages that landed after the touched page. */
        i = FILE_SIZE / 2 / page_size + 1;
        while (i < vec_size && vec[i]) {
                ra_pages++;
                i++;
        }
+
+       /*
+        * The readahead window is at least two pages wide here (narrow
+        * windows were skipped above), so the kernel must have brought in at
+        * least one neighbouring page on one side of the faulted page.
+        */
        EXPECT_GT(ra_pages, 0) {
                TH_LOG("No read-ahead pages found in memory");
        }
-- 
2.43.0

Reply via email to