The template fast path still leaves the actual copy sequence up to the compiler. Use the streaming-copy helpers introduced in the previous patches for the ZONE_DEVICE template-copy path so common mm code can request a write-once copy primitive without embedding arch-specific store layout in the generic layer.
ZONE_DEVICE memmap initialization is a write-once path: each struct page is populated once and is not expected to be reused from cache immediately afterwards. A regular cached copy can therefore incur write-allocate traffic and pollute the cache without much benefit. Using memcpy_streaming() lets this path use an architecture-optimized streaming copy where available, while still degrading to memcpy() on architectures that do not provide a specialized implementation. Keep pageblock-aligned PFNs on memcpy() so pageblock initialization can immediately read back page metadata without introducing a read-after-streaming dependency. For the remaining PFNs, use memcpy_streaming() so the hot path can avoid write-allocate traffic while still leaving unsupported or unsuitable cases to the fallback implementation. When the streaming backend uses non-temporal stores, order them before entering memmap_init_compound(), before prep_compound_head() updates the overlapping compound metadata, and before returning from memmap_init_zone_device(). Keep sanitized builds on the slow path so KASAN/KMSAN retain their instrumented stores. Tested in a VM with a 100 GB fsdax namespace device configured with map=dev and a 100 GB devdax namespace (align=2097152) on Intel Ice Lake server. Test procedure: Rebind the nd_pmem and dax_pmem driver 30 times and collect the memmap initialization time from the pr_debug() output of memmap_init_zone_device(). Base(v7.1-rc6): First binding for nd_pmem driver: 1466 ms Average of subsequent rebinds: 262.12 ms First binding for dax_pmem driver: 1430 ms Average of subsequent rebinds: 229.12 ms With this series: First binding for nd_pmem driver: 1359 ms Average of subsequent rebinds: 108.36 ms First binding for dax_pmem driver: 1273 ms Average of subsequent rebinds: 100.17 ms This reduces the average rebind time by about 58.6% for nd_pmem and 56.3% for dax_pmem. Signed-off-by: Li Zhe <[email protected]> --- mm/mm_init.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 47 insertions(+), 2 deletions(-) diff --git a/mm/mm_init.c b/mm/mm_init.c index ad078ee354fb..fbc873284fb8 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -1070,11 +1070,21 @@ static void __ref zone_device_page_init_slow(struct page *page, static inline bool zone_device_page_init_optimization_enabled(void) { + /* + * Keep sanitized builds on the slow path so their stores stay + * instrumented. + */ + if (IS_ENABLED(CONFIG_KASAN) || IS_ENABLED(CONFIG_KMSAN)) + return false; + /* * The template fast path copies a preinitialized struct page image. * Skip it when the page_ref_set tracepoint is enabled. */ - return !page_ref_tracepoint_active(page_ref_set); + if (page_ref_tracepoint_active(page_ref_set)) + return false; + + return true; } static inline void zone_device_template_page_init(struct page *template, @@ -1117,9 +1127,19 @@ static void zone_device_page_init_from_template(struct page *page, * 'template' carries the invariant portion of a ZONE_DEVICE struct * page. Update the PFN-dependent fields in place before copying it * to the destination page. + * + * pageblock-aligned pages immediately feed + * init_pageblock_migratetype(), which reads back page metadata via + * helpers like page_zone(page). Avoid a read-after-streaming + * dependency for these rare pages by using regular cached stores + * instead of non-temporal ones. */ zone_device_page_update_template(template, pfn); - memcpy(page, template, sizeof(*page)); + if (unlikely(pageblock_aligned(pfn))) + memcpy(page, template, sizeof(*page)); + else + memcpy_streaming(page, template, sizeof(*page)); + zone_device_page_init_pageblock(page, pfn); } @@ -1184,6 +1204,15 @@ static void __ref memmap_init_compound(struct page *head, zone_device_tail_page_init(page, pfn, zone_idx, nid, pgmap, head, order); } + + /* + * When the template path is enabled, order the preceding tail-page copies + * before prep_compound_head() updates the overlapping compound metadata + * in the first tail-page descriptors. If memcpy_streaming() fell back to + * regular cached stores, memcpy_streaming_drain() may be a no-op. + */ + if (use_template) + memcpy_streaming_drain(); prep_compound_head(head, order); } @@ -1248,10 +1277,26 @@ void __ref memmap_init_zone_device(struct zone *zone, if (pfns_per_compound == 1) continue; + /* + * When the template path is enabled, order the preceding head-page copy + * before memmap_init_compound(), which immediately updates compound-head + * metadata. If memcpy_streaming() fell back to regular cached stores, + * memcpy_streaming_drain() may be a no-op. + */ + if (use_template) + memcpy_streaming_drain(); + memmap_init_compound(page, pfn, zone_idx, nid, pgmap, compound_nr_pages(altmap, pgmap), use_template); } + /* + * Ensure any prior template copies are ordered before returning. + * On architectures where memcpy_streaming() used regular cached stores, + * memcpy_streaming_drain() may be a no-op. + */ + if (use_template) + memcpy_streaming_drain(); pr_debug("%s initialised %lu pages in %ums\n", __func__, nr_pages, jiffies_to_msecs(jiffies - start)); -- 2.20.1

