On 14 May 2025, at 7:15, David Hildenbrand wrote: > A long-term goal is supporting frozen PageOffline pages, and later > PageOffline pages that don't have a refcount at all. Some more work for > that is needed -- in particular around non-folio page migration and > memory ballooning drivers -- but let's start by handling PageOffline pages > that can be skipped during memory offlining differently. > > Let's introduce a PageOffline specific page flag (PG_offline_skippable) > that for now reuses PG_owner_2. In the memdesc future, it will be one of > a small number of per-memdesc flags stored alongside the type. > > By setting PG_offline_skippable, a driver indicates that it can > restore the PageOffline state of these specific pages when re-onlining a > memory block: it knows that these pages are supposed to be PageOffline() > without the information in the vmemmap, so it can filter them out and > not expose them to the buddy -> they stay PageOffline(). > > While PG_offline_offlineable might be clearer, it is also super > confusing. Alternatives (PG_offline_sticky?) also don't quite feel right. > So let's use "skippable" for now. > > The flag is not supposed to be used for movable PageOffline pages as > used for balloon compaction; movable PageOffline() pages can simply be > migrated during the memory offlining stage. > > Let's convert the single user from our MEM_GOING_OFFLINE approach > to the new PG_offline_skippable approach: virtio-mem. Fortunately, > this simplifies the code quite a lot. > > What if someone decides to grab a reference on these pages although they > really shouldn't? After all, we'll now keep the refcount at 1 (until we > can properly stop using the refcount completely). > > Well, less worse things will happen than would currently: currently, > if someone would grab a reference to these pages, in MEM_GOING_OFFLINE > we would run into the > if (WARN_ON(!page_ref_dec_and_test(page))) > dump_page(page, "fake-offline page referenced"); > > And once that unexpected reference would get dropped, we would end up > freeing that page to the buddy: ouch. > > Now, we'll allow for offlining that memory, and when that unexpected > reference would get dropped, we would not end up freeing that page to > the buddy. Once we have frozen PageOffline() pages, it will all get a > lot cleaner. > > Note that we didn't see the existing WARN_ON so far, because nobody > should ever be referencing such pages. > > Signed-off-by: David Hildenbrand <da...@redhat.com> > --- > drivers/virtio/virtio_mem.c | 111 +----------------------------------- > include/linux/page-flags.h | 29 +++++++--- > mm/memory_hotplug.c | 12 ++-- > mm/page_alloc.c | 8 +-- > mm/page_isolation.c | 21 +++---- > 5 files changed, 42 insertions(+), 139 deletions(-) > > diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c > index 56d0dbe621637..77b72843c4b53 100644 > --- a/drivers/virtio/virtio_mem.c > +++ b/drivers/virtio/virtio_mem.c > @@ -278,10 +278,6 @@ static DEFINE_MUTEX(virtio_mem_mutex); > static LIST_HEAD(virtio_mem_devices); > > static void virtio_mem_online_page_cb(struct page *page, unsigned int order); > -static void virtio_mem_fake_offline_going_offline(unsigned long pfn, > - unsigned long nr_pages); > -static void virtio_mem_fake_offline_cancel_offline(unsigned long pfn, > - unsigned long nr_pages); > static void virtio_mem_retry(struct virtio_mem *vm); > static int virtio_mem_create_resource(struct virtio_mem *vm); > static void virtio_mem_delete_resource(struct virtio_mem *vm); > @@ -924,64 +920,6 @@ static void virtio_mem_sbm_notify_online(struct > virtio_mem *vm, > virtio_mem_sbm_set_mb_state(vm, mb_id, new_state); > } > > -static void virtio_mem_sbm_notify_going_offline(struct virtio_mem *vm, > - unsigned long mb_id) > -{ > - const unsigned long nr_pages = PFN_DOWN(vm->sbm.sb_size); > - unsigned long pfn; > - int sb_id; > - > - for (sb_id = 0; sb_id < vm->sbm.sbs_per_mb; sb_id++) { > - if (virtio_mem_sbm_test_sb_plugged(vm, mb_id, sb_id, 1)) > - continue; > - pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) + > - sb_id * vm->sbm.sb_size); > - virtio_mem_fake_offline_going_offline(pfn, nr_pages); > - } > -} > - > -static void virtio_mem_sbm_notify_cancel_offline(struct virtio_mem *vm, > - unsigned long mb_id) > -{ > - const unsigned long nr_pages = PFN_DOWN(vm->sbm.sb_size); > - unsigned long pfn; > - int sb_id; > - > - for (sb_id = 0; sb_id < vm->sbm.sbs_per_mb; sb_id++) { > - if (virtio_mem_sbm_test_sb_plugged(vm, mb_id, sb_id, 1)) > - continue; > - pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) + > - sb_id * vm->sbm.sb_size); > - virtio_mem_fake_offline_cancel_offline(pfn, nr_pages); > - } > -} > - > -static void virtio_mem_bbm_notify_going_offline(struct virtio_mem *vm, > - unsigned long bb_id, > - unsigned long pfn, > - unsigned long nr_pages) > -{ > - /* > - * When marked as "fake-offline", all online memory of this device block > - * is allocated by us. Otherwise, we don't have any memory allocated. > - */ > - if (virtio_mem_bbm_get_bb_state(vm, bb_id) != > - VIRTIO_MEM_BBM_BB_FAKE_OFFLINE) > - return; > - virtio_mem_fake_offline_going_offline(pfn, nr_pages); > -} > - > -static void virtio_mem_bbm_notify_cancel_offline(struct virtio_mem *vm, > - unsigned long bb_id, > - unsigned long pfn, > - unsigned long nr_pages) > -{ > - if (virtio_mem_bbm_get_bb_state(vm, bb_id) != > - VIRTIO_MEM_BBM_BB_FAKE_OFFLINE) > - return; > - virtio_mem_fake_offline_cancel_offline(pfn, nr_pages); > -} > - > /* > * This callback will either be called synchronously from add_memory() or > * asynchronously (e.g., triggered via user space). We have to be careful > @@ -1040,12 +978,6 @@ static int virtio_mem_memory_notifier_cb(struct > notifier_block *nb, > break; > } > vm->hotplug_active = true; > - if (vm->in_sbm) > - virtio_mem_sbm_notify_going_offline(vm, id); > - else > - virtio_mem_bbm_notify_going_offline(vm, id, > - mhp->start_pfn, > - mhp->nr_pages); > break; > case MEM_GOING_ONLINE: > mutex_lock(&vm->hotplug_mutex); > @@ -1094,12 +1026,6 @@ static int virtio_mem_memory_notifier_cb(struct > notifier_block *nb, > case MEM_CANCEL_OFFLINE: > if (!vm->hotplug_active) > break; > - if (vm->in_sbm) > - virtio_mem_sbm_notify_cancel_offline(vm, id); > - else > - virtio_mem_bbm_notify_cancel_offline(vm, id, > - mhp->start_pfn, > - mhp->nr_pages); > vm->hotplug_active = false; > mutex_unlock(&vm->hotplug_mutex); > break; > @@ -1157,6 +1083,7 @@ static void virtio_mem_set_fake_offline(unsigned long > pfn, > SetPageDirty(page); > else > __SetPageOffline(page); > + __SetPageOfflineSkippable(page); > VM_WARN_ON_ONCE(!PageOffline(page)); > } > page_offline_end(); > @@ -1172,6 +1099,7 @@ static void virtio_mem_clear_fake_offline(unsigned long > pfn, > for (; nr_pages--; pfn++) { > struct page *page = pfn_to_page(pfn); > > + __ClearPageOfflineSkippable(page); > if (!onlined) > /* generic_online_page() will clear PageOffline(). */ > ClearPageDirty(page); > @@ -1261,41 +1189,6 @@ static int virtio_mem_fake_offline(struct virtio_mem > *vm, unsigned long pfn, > return -EBUSY; > } > > -/* > - * Handle fake-offline pages when memory is going offline - such that the > - * pages can be skipped by mm-core when offlining. > - */ > -static void virtio_mem_fake_offline_going_offline(unsigned long pfn, > - unsigned long nr_pages) > -{ > - struct page *page; > - unsigned long i; > - > - /* Drop our reference to the pages so the memory can get offlined. */ > - for (i = 0; i < nr_pages; i++) { > - page = pfn_to_page(pfn + i); > - if (WARN_ON(!page_ref_dec_and_test(page))) > - dump_page(page, "fake-offline page referenced"); > - } > -} > - > -/* > - * Handle fake-offline pages when memory offlining is canceled - to undo > - * what we did in virtio_mem_fake_offline_going_offline(). > - */ > -static void virtio_mem_fake_offline_cancel_offline(unsigned long pfn, > - unsigned long nr_pages) > -{ > - unsigned long i; > - > - /* > - * Get the reference again that we dropped via page_ref_dec_and_test() > - * when going offline. > - */ > - for (i = 0; i < nr_pages; i++) > - page_ref_inc(pfn_to_page(pfn + i)); > -} > - > static void virtio_mem_online_page(struct virtio_mem *vm, > struct page *page, unsigned int order) > { > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 1c1d49554c71a..581510ae8e83a 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -178,6 +178,9 @@ enum pageflags { > PG_vmemmap_self_hosted = PG_owner_priv_1, > #endif > > + /* The driver allows for skipping these pages during memory offlining */ > + PG_offline_skippable = PG_owner_2, > + > /* > * Flags only valid for compound pages. Stored in first tail page's > * flags word. Cannot use the first 8 flags or any flag marked as > @@ -1029,14 +1032,23 @@ PAGE_TYPE_OPS(Buddy, buddy, buddy) > * refcount of 1 and PageOffline(). generic_online_page() will > * take care of clearing PageOffline(). > * > - * If a driver wants to allow to offline unmovable PageOffline() pages > without > - * putting them back to the buddy, it can do so via the memory notifier by > - * decrementing the reference count in MEM_GOING_OFFLINE and incrementing the > - * reference count in MEM_CANCEL_OFFLINE. When offlining, the PageOffline() > - * pages (now with a reference count of zero) are treated like free > (unmanaged) > - * pages, allowing the containing memory block to get offlined. A driver that > - * relies on this feature is aware that re-onlining the memory block will > - * require not giving them to the buddy via generic_online_page(). > + * If a driver wants to allow for offlining unmovable PageOffline() pages > + * when offlining a memory block without exposing these pages as "free" to > + * the buddy, it can mark them as PG_offline_skippable. > + * > + * By marking these PageOffline pages PG_offline_skippable, the driver > + * acknowledges that it > + * (a) knows which pages are PageOffline even without the memmap. > + * (b) implements the online_page_callback_t callback to exclude these pages > + * from getting exposed to the buddy, so they will stay PageOffline() > + * when onlining a memory block. > + * (c) synchronizes against concurrent memory onlining/offlining whenever > + * adjusting the PG_offline_skippable flag. > + * > + * Note that the refcount of these pages will for now be assumed to always > + * be 1: nobody except the owner should be referencing these pages. > + * Offlining code will complain if the refcount is not 1. In the future, > + * these pages will always be frozen and not have a refcount. > * > * Memory offlining code will not adjust the managed page count for any > * PageOffline() pages, treating them like they were never exposed to the > @@ -1048,6 +1060,7 @@ PAGE_TYPE_OPS(Buddy, buddy, buddy) > * page_offline_freeze()/page_offline_thaw(). > */ > PAGE_TYPE_OPS(Offline, offline, offline) > +__PAGEFLAG(OfflineSkippable, offline_skippable, PF_NO_COMPOUND) > > extern void page_offline_freeze(void); > extern void page_offline_thaw(void); > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index b1caedbade5b1..0cc5537f234bb 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1767,12 +1767,10 @@ static int scan_movable_pages(unsigned long start, > unsigned long end, > goto found; > > /* > - * PageOffline() pages that are not marked __PageMovable() and > - * have a reference count > 0 (after MEM_GOING_OFFLINE) are > - * definitely unmovable. If their reference count would be 0, > - * they could at least be skipped when offlining memory. > + * PageOffline() pages that are neither "movable" nor > + * "skippable" prevent memory offlining. > */ > - if (PageOffline(page) && page_count(page)) > + if (PageOffline(page) && !PageOfflineSkippable(page)) > return -EBUSY; > > if (!PageHuge(page)) > @@ -1807,6 +1805,10 @@ static void do_migrate_range(unsigned long start_pfn, > unsigned long end_pfn) > struct page *page; > > page = pfn_to_page(pfn); > + > + if (PageOffline(page) && PageOfflineSkippable(page)) > + continue; > +
Some comment like "Skippable PageOffline() pages are not migratable but are skipped during memory offline" might help understand the change. Some thoughts after reading the related code: offline_pages() is a little convoluted, since it has two steps to make sure a range of memory can be offlined: 1. start_isolate_page_range() checks unmovable (maybe not migratable is more precise) pages using has_unmovable_pages(), but leaves unmovable PageOffline() page handling to the driver; 2. scan_movable_pages() and do_migrate_range() migrate pages and handle different types of PageOffline() pages. It might make the logic cleaner if start_isolate_page_range() can have a callback to allow driver to handle PageOffline() pages. Hmm, Skippable PageOffline() pages are virtio-mem specific, I wonder why offline_pages() takes care of them. Shouldn't virtio-mem driver handle them? I also realize that the two steps in offline_pages() are very similar to alloc_contig_range() and virtio-mem is using alloc_contig_range(). I wonder if the two steps in offline_pages() can be abstracted out and shared with virtio-mem. Yes, offline_pages() operates at memory section granularity, different from the granularity, pageblock size, of alloc_contig_range() used in virtio-mem, but both are trying to check unmovable pages and migrate movable pages. > folio = page_folio(page); > > if (!folio_try_get(folio)) > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index a6fe1e9b95941..325b51c905076 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -7023,12 +7023,12 @@ unsigned long __offline_isolated_pages(unsigned long > start_pfn, > continue; > } > /* > - * At this point all remaining PageOffline() pages have a > - * reference count of 0 and can simply be skipped. > + * At this point all remaining PageOffline() pages must be > + * "skippable" and have exactly one reference. > */ > if (PageOffline(page)) { > - BUG_ON(page_count(page)); > - BUG_ON(PageBuddy(page)); > + WARN_ON_ONCE(!PageOfflineSkippable(page)); > + WARN_ON_ONCE(page_count(page) != 1); > already_offline++; > pfn++; > continue; > diff --git a/mm/page_isolation.c b/mm/page_isolation.c > index b2fc5266e3d26..debd898b794ea 100644 > --- a/mm/page_isolation.c > +++ b/mm/page_isolation.c > @@ -121,16 +121,11 @@ static struct page *has_unmovable_pages(unsigned long > start_pfn, unsigned long e > continue; > > /* > - * We treat all PageOffline() pages as movable when offlining > - * to give drivers a chance to decrement their reference count > - * in MEM_GOING_OFFLINE in order to indicate that these pages > - * can be offlined as there are no direct references anymore. > - * For actually unmovable PageOffline() where the driver does > - * not support this, we will fail later when trying to actually > - * move these pages that still have a reference count > 0. > - * (false negatives in this function only) > + * PageOffline() pages that are marked as "skippable" > + * are treated like movable pages for memory offlining. > */ > - if ((flags & MEMORY_OFFLINE) && PageOffline(page)) > + if ((flags & MEMORY_OFFLINE) && PageOffline(page) && > + PageOfflineSkippable(page)) > continue; With this change, we are no longer give non-virtio-mem driver a chance to decrease PageOffline(page) refcount? Or virtio-mem is the only driver doing this? > > if (__PageMovable(page) || PageLRU(page)) > @@ -577,11 +572,11 @@ __test_page_isolated_in_pageblock(unsigned long pfn, > unsigned long end_pfn, > /* A HWPoisoned page cannot be also PageBuddy */ > pfn++; > else if ((flags & MEMORY_OFFLINE) && PageOffline(page) && > - !page_count(page)) > + PageOfflineSkippable(page)) The same question as above. > /* > - * The responsible driver agreed to skip PageOffline() > - * pages when offlining memory by dropping its > - * reference in MEM_GOING_OFFLINE. > + * If the page is a skippable PageOffline() page, > + * we can offline the memory block, as the driver will > + * re-discover them when re-onlining the memory. > */ > pfn++; > else > -- > 2.49.0 Otherwise, LGTM. Acked-by: Zi Yan <z...@nvidia.com> -- Best Regards, Yan, Zi