Re: [PATCH v10 1/5] kasan: support backing vmalloc space with real shadow memory
Hello, Daniel > > @@ -1294,14 +1299,19 @@ static bool __purge_vmap_area_lazy(unsigned long > start, unsigned long end) > spin_lock(&free_vmap_area_lock); > llist_for_each_entry_safe(va, n_va, valist, purge_list) { > unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT; > + unsigned long orig_start = va->va_start; > + unsigned long orig_end = va->va_end; > > /* >* Finally insert or merge lazily-freed area. It is >* detached and there is no need to "unlink" it from >* anything. >*/ > - merge_or_add_vmap_area(va, > - &free_vmap_area_root, &free_vmap_area_list); > + va = merge_or_add_vmap_area(va, &free_vmap_area_root, > + &free_vmap_area_list); > + > + kasan_release_vmalloc(orig_start, orig_end, > + va->va_start, va->va_end); > I have some questions here. I have not analyzed kasan_releace_vmalloc() logic in detail, sorry for that if i miss something. __purge_vmap_area_lazy() deals with big address space, so not only vmalloc addresses it frees here, basically it can be any, starting from 1 until ULONG_MAX, whereas vmalloc space spans from VMALLOC_START - VMALLOC_END: 1) Should it be checked that vmalloc only address is freed or you handle it somewhere else? if (is_vmalloc_addr(va->va_start)) kasan_release_vmalloc(...) 2) Have you run any bencmarking just to see how much overhead it adds? I am asking, because probably it make sense to add those figures to the backlog(commit message). For example you can run: sudo ./test_vmalloc.sh performance and sudo ./test_vmalloc.sh sequential_test_order=1 Thanks! -- Vlad Rezki
Re: [PATCH v8 1/5] kasan: support backing vmalloc space with real shadow memory
Hello, Daniel. > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index a3c70e275f4e..9fb7a16f42ae 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -690,8 +690,19 @@ merge_or_add_vmap_area(struct vmap_area *va, > struct list_head *next; > struct rb_node **link; > struct rb_node *parent; > + unsigned long orig_start, orig_end; Shouldn't that be wrapped around #ifdef CONFIG_KASAN_VMALLOC? > bool merged = false; > > + /* > + * To manage KASAN vmalloc memory usage, we use this opportunity to > + * clean up the shadow memory allocated to back this allocation. > + * Because a vmalloc shadow page covers several pages, the start or end > + * of an allocation might not align with a shadow page. Use the merging > + * opportunities to try to extend the region we can release. > + */ > + orig_start = va->va_start; > + orig_end = va->va_end; > + The same. > /* >* Find a place in the tree where VA potentially will be >* inserted, unless it is merged with its sibling/siblings. > @@ -741,6 +752,10 @@ merge_or_add_vmap_area(struct vmap_area *va, > if (sibling->va_end == va->va_start) { > sibling->va_end = va->va_end; > > + kasan_release_vmalloc(orig_start, orig_end, > + sibling->va_start, > + sibling->va_end); > + The same. > /* Check and update the tree if needed. */ > augment_tree_propagate_from(sibling); > > @@ -754,6 +769,8 @@ merge_or_add_vmap_area(struct vmap_area *va, > } > > insert: > + kasan_release_vmalloc(orig_start, orig_end, va->va_start, va->va_end); > + The same + all further changes in this file. > if (!merged) { > link_va(va, root, parent, link, head); > augment_tree_propagate_from(va); > @@ -2068,6 +2085,22 @@ static struct vm_struct *__get_vm_area_node(unsigned > long size, > > setup_vmalloc_vm(area, va, flags, caller); > > + /* > + * For KASAN, if we are in vmalloc space, we need to cover the shadow > + * area with real memory. If we come here through VM_ALLOC, this is > + * done by a higher level function that has access to the true size, > + * which might not be a full page. > + * > + * We assume module space comes via VM_ALLOC path. > + */ > + if (is_vmalloc_addr(area->addr) && !(area->flags & VM_ALLOC)) { > + if (kasan_populate_vmalloc(area->size, area)) { > + unmap_vmap_area(va); > + kfree(area); > + return NULL; > + } > + } > + > return area; > } > > @@ -2245,6 +2278,9 @@ static void __vunmap(const void *addr, int > deallocate_pages) > debug_check_no_locks_freed(area->addr, get_vm_area_size(area)); > debug_check_no_obj_freed(area->addr, get_vm_area_size(area)); > > + if (area->flags & VM_KASAN) > + kasan_poison_vmalloc(area->addr, area->size); > + > vm_remove_mappings(area, deallocate_pages); > > if (deallocate_pages) { > @@ -2497,6 +2533,9 @@ void *__vmalloc_node_range(unsigned long size, unsigned > long align, > if (!addr) > return NULL; > > + if (kasan_populate_vmalloc(real_size, area)) > + return NULL; > + > /* >* In this function, newly allocated vm_struct has VM_UNINITIALIZED >* flag. It means that vm_struct is not fully initialized. > @@ -3351,10 +3390,14 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned > long *offsets, > spin_unlock(&vmap_area_lock); > > /* insert all vm's */ > - for (area = 0; area < nr_vms; area++) > + for (area = 0; area < nr_vms; area++) { > setup_vmalloc_vm(vms[area], vas[area], VM_ALLOC, >pcpu_get_vm_areas); > > + /* assume success here */ > + kasan_populate_vmalloc(sizes[area], vms[area]); > + } > + > kfree(vas); > return vms; > -- Vlad Rezki
Re: [PATCH v8 1/5] kasan: support backing vmalloc space with real shadow memory
On Wed, Oct 02, 2019 at 11:23:06AM +1000, Daniel Axtens wrote: > Hi, > > >>/* > >> * Find a place in the tree where VA potentially will be > >> * inserted, unless it is merged with its sibling/siblings. > >> @@ -741,6 +752,10 @@ merge_or_add_vmap_area(struct vmap_area *va, > >>if (sibling->va_end == va->va_start) { > >>sibling->va_end = va->va_end; > >> > >> + kasan_release_vmalloc(orig_start, orig_end, > >> +sibling->va_start, > >> +sibling->va_end); > >> + > > The same. > > The call to kasan_release_vmalloc() is a static inline no-op if > CONFIG_KASAN_VMALLOC is not defined, which I thought was the preferred > way to do things rather than sprinkling the code with ifdefs? > I agree that is totally correct. > The complier should be smart enough to eliminate all the > orig_state/orig_end stuff at compile time because it can see that it's > not used, so there's no cost in the binary. > It should. I was more thinking about if those two variables can be considered as unused, resulting in compile warning like "set but not used". But that is theory and in case of having any warning the test robot will notify anyway about that. So, i am totally fine with that if compiler does not complain. If so, please ignore my comments :) -- Vlad Rezki
Re: [PATCH v8 1/5] kasan: support backing vmalloc space with real shadow memory
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index a3c70e275f4e..9fb7a16f42ae 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -690,8 +690,19 @@ merge_or_add_vmap_area(struct vmap_area *va, > struct list_head *next; > struct rb_node **link; > struct rb_node *parent; > + unsigned long orig_start, orig_end; > bool merged = false; > > + /* > + * To manage KASAN vmalloc memory usage, we use this opportunity to > + * clean up the shadow memory allocated to back this allocation. > + * Because a vmalloc shadow page covers several pages, the start or end > + * of an allocation might not align with a shadow page. Use the merging > + * opportunities to try to extend the region we can release. > + */ > + orig_start = va->va_start; > + orig_end = va->va_end; > + > /* >* Find a place in the tree where VA potentially will be >* inserted, unless it is merged with its sibling/siblings. > @@ -741,6 +752,10 @@ merge_or_add_vmap_area(struct vmap_area *va, > if (sibling->va_end == va->va_start) { > sibling->va_end = va->va_end; > > + kasan_release_vmalloc(orig_start, orig_end, > + sibling->va_start, > + sibling->va_end); > + > /* Check and update the tree if needed. */ > augment_tree_propagate_from(sibling); > > @@ -754,6 +769,8 @@ merge_or_add_vmap_area(struct vmap_area *va, > } > > insert: > + kasan_release_vmalloc(orig_start, orig_end, va->va_start, va->va_end); > + > if (!merged) { > link_va(va, root, parent, link, head); > augment_tree_propagate_from(va); Hello, Daniel. Looking at it one more, i think above part of code is a bit wrong and should be separated from merge_or_add_vmap_area() logic. The reason is to keep it simple and do only what it is supposed to do: merging or adding. Also the kasan_release_vmalloc() gets called twice there and looks like a duplication. Apart of that, merge_or_add_vmap_area() can be called via recovery path when vmap/vmaps is/are not even setup. See percpu allocator. I guess your part could be moved directly to the __purge_vmap_area_lazy() where all vmaps are lazily freed. To do so, we also need to modify merge_or_add_vmap_area() to return merged area: diff --git a/mm/vmalloc.c b/mm/vmalloc.c index e92ff5f7dd8b..fecde4312d68 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -683,7 +683,7 @@ insert_vmap_area_augment(struct vmap_area *va, * free area is inserted. If VA has been merged, it is * freed. */ -static __always_inline void +static __always_inline struct vmap_area * merge_or_add_vmap_area(struct vmap_area *va, struct rb_root *root, struct list_head *head) { @@ -750,7 +750,10 @@ merge_or_add_vmap_area(struct vmap_area *va, /* Free vmap_area object. */ kmem_cache_free(vmap_area_cachep, va); - return; + + /* Point to the new merged area. */ + va = sibling; + merged = true; } } @@ -759,6 +762,8 @@ merge_or_add_vmap_area(struct vmap_area *va, link_va(va, root, parent, link, head); augment_tree_propagate_from(va); } + + return va; } static __always_inline bool @@ -1172,7 +1177,7 @@ static void __free_vmap_area(struct vmap_area *va) /* * Merge VA with its neighbors, otherwise just add it. */ - merge_or_add_vmap_area(va, + (void) merge_or_add_vmap_area(va, &free_vmap_area_root, &free_vmap_area_list); } @@ -1279,15 +1284,20 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) spin_lock(&vmap_area_lock); llist_for_each_entry_safe(va, n_va, valist, purge_list) { unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT; + unsigned long orig_start = va->va_start; + unsigned long orig_end = va->va_end; /* * Finally insert or merge lazily-freed area. It is * detached and there is no need to "unlink" it from * anything. */ - merge_or_add_vmap_area(va, + va = merge_or_add_vmap_area(va, &free_vmap_area_root, &free_vmap_area_list); + kasan_release_vmalloc(orig_start, + orig_end, va->va_start, va->va_end); + atomic_long_sub(nr, &vmap_lazy_nr); if (atomic_long_read(&vmap_lazy_nr) < resched_threshold) -- Vlad Rezki
Re: powerpc 5.10-rcN boot failures with RCU_SCALE_TEST=m
> Hi all, > > I'm having some difficulty tracking down a bug. > > Some configurations of the powerpc kernel since somewhere in the 5.10 > merge window fail to boot on some ppc64 systems. They hang while trying > to bring up SMP. It seems to depend on the RCU_SCALE/PERF_TEST option. > (It was renamed in the 5.10 merge window.) > > I can reproduce it as follows with qemu tcg: > > make -j64 pseries_le_defconfig > scripts/config -m RCU_SCALE_TEST > scripts/config -m RCU_PERF_TEST > make -j 64 vmlinux CC="ccache gcc" > > qemu-system-ppc64 -cpu power9 -M pseries -m 1G -nographic -vga none -smp 4 > -kernel vmlinux > > ... > [0.036284][T0] Mount-cache hash table entries: 8192 (order: 0, 65536 > bytes, linear) > [0.036481][T0] Mountpoint-cache hash table entries: 8192 (order: 0, > 65536 bytes, linear) > [0.148168][T1] POWER9 performance monitor hardware support registered > [0.151118][T1] rcu: Hierarchical SRCU implementation. > [0.186660][T1] smp: Bringing up secondary CPUs ... > > I am not sure if that is helpful but i checked it on my x86_64 system using the cross compiled for powerpc Linux kernel: urezki@pc638:~/data/coding/linux-rcu.git$ powerpc64-linux-gnu-gcc --version powerpc64-linux-gnu-gcc (Debian 8.3.0-2) 8.3.0 Copyright (C) 2018 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. urezki@pc638:~/data/coding/linux-rcu.git$ with ARCH=powerpc make -j64 pseries_defconfig configuration. I used the 5.10.0-rc1 kernel. urezki@pc638:~/data/coding/linux-rcu.git$ qemu-system-ppc64 --version QEMU emulator version 3.1.0 (Debian 1:3.1+dfsg-8+deb10u8) Copyright (c) 2003-2018 Fabrice Bellard and the QEMU Project developers urezki@pc638:~/data/coding/linux-rcu.git$ On my setup i can run the kernel with and without the modules which are in question. So qemu-system-ppc64 -cpu power9 -M pseries -m 1G -nographic -vga none -smp 4 -kernel ./vmlinux works for both cases, with RCU_SCALE and without. Just in case, maybe this information can be useful also. -- Vlad Rezki
Re: powerpc 5.10-rcN boot failures with RCU_SCALE_TEST=m
On Thu, Dec 03, 2020 at 01:03:32AM +1100, Michael Ellerman wrote: > Daniel Axtens writes: > > Hi all, > > > > I'm having some difficulty tracking down a bug. > > > > Some configurations of the powerpc kernel since somewhere in the 5.10 > > merge window fail to boot on some ppc64 systems. They hang while trying > > to bring up SMP. It seems to depend on the RCU_SCALE/PERF_TEST option. > > (It was renamed in the 5.10 merge window.) > > > > I can reproduce it as follows with qemu tcg: > > > > make -j64 pseries_le_defconfig > > scripts/config -m RCU_SCALE_TEST > > scripts/config -m RCU_PERF_TEST > > make -j 64 vmlinux CC="ccache gcc" > > > > qemu-system-ppc64 -cpu power9 -M pseries -m 1G -nographic -vga none -smp 4 > > -kernel vmlinux > > > > ... > > [0.036284][T0] Mount-cache hash table entries: 8192 (order: 0, > > 65536 bytes, linear) > > [0.036481][T0] Mountpoint-cache hash table entries: 8192 (order: 0, > > 65536 bytes, linear) > > [0.148168][T1] POWER9 performance monitor hardware support > > registered > > [0.151118][T1] rcu: Hierarchical SRCU implementation. > > [0.186660][T1] smp: Bringing up secondary CPUs ... > > > > One does not simply hang :) > > > I have no idea why RCU_SCALE/PERF_TEST would be causing this, but that > > seems to be what does it: if I don't set that, the kernel boots fine. > > It seems to be TASKS_RCU that is the key. > > I don't need RCU_SCALE_TEST enabled, I can trigger it just with the > following applied: > > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig > index 0ebe15a84985..f3500c95d6a1 100644 > --- a/kernel/rcu/Kconfig > +++ b/kernel/rcu/Kconfig > @@ -78,7 +78,7 @@ config TASKS_RCU_GENERIC > task-based RCU implementations. Not for manual selection. > > config TASKS_RCU > - def_bool PREEMPTION > + def_bool y > help > This option enables a task-based RCU implementation that uses > only voluntary context switch (not preemption!), idle, and > > > And bisect points to: > 36dadef23fcc ("kprobes: Init kprobes in early_initcall") > > Which moved init_kprobes() prior to SMP bringup. > > > For some reason when it gets stuck sysrq doesn't work, but I was able to > get it into gdb and manually call handle_sysrq('t') to get the output > below. > > The SMP bringup stalls because _cpu_up() is blocked trying to take > cpu_hotplug_lock for writing: > > [ 401.403132][T0] task:swapper/0 state:D stack:12512 pid:1 > ppid: 0 flags:0x0800 > [ 401.403502][T0] Call Trace: > [ 401.403907][T0] [c62c37d0] [c62c3830] > 0xc62c3830 (unreliable) > [ 401.404068][T0] [c62c39b0] [c0019d70] > __switch_to+0x2e0/0x4a0 > [ 401.404189][T0] [c62c3a10] [c0b87228] > __schedule+0x288/0x9b0 > [ 401.404257][T0] [c62c3ad0] [c0b879b8] > schedule+0x68/0x120 > [ 401.404324][T0] [c62c3b00] [c0184ad4] > percpu_down_write+0x164/0x170 > [ 401.404390][T0] [c62c3b50] [c0116b68] > _cpu_up+0x68/0x280 > [ 401.404475][T0] [c62c3bb0] [c0116e70] cpu_up+0xf0/0x140 > [ 401.404546][T0] [c62c3c30] [c011776c] > bringup_nonboot_cpus+0xac/0xf0 > [ 401.404643][T0] [c62c3c80] [c0eea1b8] > smp_init+0x40/0xcc > [ 401.404727][T0] [c62c3ce0] [c0ec43dc] > kernel_init_freeable+0x1e0/0x3a0 > [ 401.404799][T0] [c62c3db0] [c0011ec4] > kernel_init+0x24/0x150 > [ 401.404958][T0] [c62c3e20] [c000daf0] > ret_from_kernel_thread+0x5c/0x6c > > It can't get it because kprobe_optimizer() has taken it for read and is now > blocked waiting for synchronize_rcu_tasks(): > > [ 401.418808][T0] task:kworker/0:1 state:D stack:13392 pid: 12 > ppid: 2 flags:0x0800 > [ 401.418951][T0] Workqueue: events kprobe_optimizer > [ 401.419078][T0] Call Trace: > [ 401.419121][T0] [c62ef650] [c62ef710] > 0xc62ef710 (unreliable) > [ 401.419213][T0] [c62ef830] [c0019d70] > __switch_to+0x2e0/0x4a0 > [ 401.419281][T0] [c62ef890] [c0b87228] > __schedule+0x288/0x9b0 > [ 401.419347][T0] [c62ef950] [c0b879b8] > schedule+0x68/0x120 > [ 401.419415][T0] [c62ef980] [c0b8e664] > schedule_timeout+0x2a4/0x340 > [ 401.419484][T0] [c62efa80] [c0b894ec] > wait_for_completion+0x9c/0x170 > [ 401.419552][T0] [c62efae0] [c01ac85c] > __wait_rcu_gp+0x19c/0x210 > [ 401.419619][T0] [c62efb40] [c01ac90c] > synchronize_rcu_tasks_generic+0x3c/0x70 > [ 401.419690][T0] [c62efbe0] [c022a3dc] > kprobe_optimizer+0x1dc/0x470 > [ 401.419757][T0] [c62efc60] [c0136684] > process_one_work+0x2f4/0x530 > [ 401.419823][T0] [c62efd20] [c0138d28] >
Re: powerpc 5.10-rcN boot failures with RCU_SCALE_TEST=m
On Thu, Dec 03, 2020 at 05:22:20PM +1100, Michael Ellerman wrote: > Uladzislau Rezki writes: > > On Thu, Dec 03, 2020 at 01:03:32AM +1100, Michael Ellerman wrote: > ... > >> > >> The SMP bringup stalls because _cpu_up() is blocked trying to take > >> cpu_hotplug_lock for writing: > >> > >> [ 401.403132][T0] task:swapper/0 state:D stack:12512 pid:1 > >> ppid: 0 flags:0x0800 > >> [ 401.403502][T0] Call Trace: > >> [ 401.403907][T0] [c62c37d0] [c62c3830] > >> 0xc62c3830 (unreliable) > >> [ 401.404068][T0] [c62c39b0] [c0019d70] > >> __switch_to+0x2e0/0x4a0 > >> [ 401.404189][T0] [c62c3a10] [c0b87228] > >> __schedule+0x288/0x9b0 > >> [ 401.404257][T0] [c62c3ad0] [c0b879b8] > >> schedule+0x68/0x120 > >> [ 401.404324][T0] [c62c3b00] [c0184ad4] > >> percpu_down_write+0x164/0x170 > >> [ 401.404390][T0] [c62c3b50] [c0116b68] > >> _cpu_up+0x68/0x280 > >> [ 401.404475][T0] [c62c3bb0] [c0116e70] > >> cpu_up+0xf0/0x140 > >> [ 401.404546][T0] [c62c3c30] [c011776c] > >> bringup_nonboot_cpus+0xac/0xf0 > >> [ 401.404643][T0] [c62c3c80] [c0eea1b8] > >> smp_init+0x40/0xcc > >> [ 401.404727][T0] [c62c3ce0] [c0ec43dc] > >> kernel_init_freeable+0x1e0/0x3a0 > >> [ 401.404799][T0] [c62c3db0] [c0011ec4] > >> kernel_init+0x24/0x150 > >> [ 401.404958][T0] [c62c3e20] [c000daf0] > >> ret_from_kernel_thread+0x5c/0x6c > >> > >> It can't get it because kprobe_optimizer() has taken it for read and is now > >> blocked waiting for synchronize_rcu_tasks(): > >> > >> [ 401.418808][T0] task:kworker/0:1 state:D stack:13392 pid: 12 > >> ppid: 2 flags:0x0800 > >> [ 401.418951][T0] Workqueue: events kprobe_optimizer > >> [ 401.419078][T0] Call Trace: > >> [ 401.419121][T0] [c62ef650] [c62ef710] > >> 0xc62ef710 (unreliable) > >> [ 401.419213][T0] [c62ef830] [c0019d70] > >> __switch_to+0x2e0/0x4a0 > >> [ 401.419281][T0] [c62ef890] [c0b87228] > >> __schedule+0x288/0x9b0 > >> [ 401.419347][T0] [c62ef950] [c0b879b8] > >> schedule+0x68/0x120 > >> [ 401.419415][T0] [c62ef980] [c0b8e664] > >> schedule_timeout+0x2a4/0x340 > >> [ 401.419484][T0] [c62efa80] [c0b894ec] > >> wait_for_completion+0x9c/0x170 > >> [ 401.419552][T0] [c62efae0] [c01ac85c] > >> __wait_rcu_gp+0x19c/0x210 > >> [ 401.419619][T0] [c62efb40] [c01ac90c] > >> synchronize_rcu_tasks_generic+0x3c/0x70 > >> [ 401.419690][T0] [c62efbe0] [c022a3dc] > >> kprobe_optimizer+0x1dc/0x470 > >> [ 401.419757][T0] [c62efc60] [c0136684] > >> process_one_work+0x2f4/0x530 > >> [ 401.419823][T0] [c62efd20] [c0138d28] > >> worker_thread+0x78/0x570 > >> [ 401.419891][T0] [c62efdb0] [c0142424] > >> kthread+0x194/0x1a0 > >> [ 401.419976][T0] [c62efe20] [c000daf0] > >> ret_from_kernel_thread+0x5c/0x6c > >> > >> But why is the synchronize_rcu_tasks() not completing? > >> > > I think that it is because RCU is not fully initialized by that time. > > Yeah that would explain it :) > > > The 36dadef23fcc ("kprobes: Init kprobes in early_initcall") patch > > switches to early_initcall() that has a higher priority sequence than > > core_initcall() that is used to complete an RCU setup in the > > rcu_set_runtime_mode(). > > I was looking at debug_lockdep_rcu_enabled(), which is: > > noinstr int notrace debug_lockdep_rcu_enabled(void) > { > return rcu_scheduler_active != RCU_SCHEDULER_INACTIVE && debug_locks && > current->lockdep_recursion == 0; > } > > That is not firing any warnings for me because rcu_scheduler_active is: > > (gdb) p/x rcu_scheduler_active > $1 = 0x1 > > Which is: > > #define RCU_SCHEDULER_INIT1 > Agree with that. > But that's different to RCU_SCHEDULER_RUNNING, which is set in > rcu_set_runtime_m
Re: powerpc 5.10-rcN boot failures with RCU_SCALE_TEST=m
On Thu, Dec 03, 2020 at 03:34:45PM +0100, Uladzislau Rezki wrote: > On Thu, Dec 03, 2020 at 05:22:20PM +1100, Michael Ellerman wrote: > > Uladzislau Rezki writes: > > > On Thu, Dec 03, 2020 at 01:03:32AM +1100, Michael Ellerman wrote: > > ... > > >> > > >> The SMP bringup stalls because _cpu_up() is blocked trying to take > > >> cpu_hotplug_lock for writing: > > >> > > >> [ 401.403132][T0] task:swapper/0 state:D stack:12512 pid: > > >> 1 ppid: 0 flags:0x0800 > > >> [ 401.403502][T0] Call Trace: > > >> [ 401.403907][T0] [c62c37d0] [c62c3830] > > >> 0xc62c3830 (unreliable) > > >> [ 401.404068][T0] [c62c39b0] [c0019d70] > > >> __switch_to+0x2e0/0x4a0 > > >> [ 401.404189][T0] [c62c3a10] [c0b87228] > > >> __schedule+0x288/0x9b0 > > >> [ 401.404257][T0] [c62c3ad0] [c0b879b8] > > >> schedule+0x68/0x120 > > >> [ 401.404324][T0] [c62c3b00] [c0184ad4] > > >> percpu_down_write+0x164/0x170 > > >> [ 401.404390][T0] [c62c3b50] [c0116b68] > > >> _cpu_up+0x68/0x280 > > >> [ 401.404475][T0] [c62c3bb0] [c0116e70] > > >> cpu_up+0xf0/0x140 > > >> [ 401.404546][T0] [c62c3c30] [c011776c] > > >> bringup_nonboot_cpus+0xac/0xf0 > > >> [ 401.404643][T0] [c62c3c80] [c0eea1b8] > > >> smp_init+0x40/0xcc > > >> [ 401.404727][T0] [c62c3ce0] [c0ec43dc] > > >> kernel_init_freeable+0x1e0/0x3a0 > > >> [ 401.404799][T0] [c62c3db0] [c0011ec4] > > >> kernel_init+0x24/0x150 > > >> [ 401.404958][T0] [c62c3e20] [c000daf0] > > >> ret_from_kernel_thread+0x5c/0x6c > > >> > > >> It can't get it because kprobe_optimizer() has taken it for read and is > > >> now > > >> blocked waiting for synchronize_rcu_tasks(): > > >> > > >> [ 401.418808][T0] task:kworker/0:1 state:D stack:13392 pid: > > >> 12 ppid: 2 flags:0x0800 > > >> [ 401.418951][T0] Workqueue: events kprobe_optimizer > > >> [ 401.419078][T0] Call Trace: > > >> [ 401.419121][T0] [c62ef650] [c62ef710] > > >> 0xc62ef710 (unreliable) > > >> [ 401.419213][T0] [c62ef830] [c0019d70] > > >> __switch_to+0x2e0/0x4a0 > > >> [ 401.419281][T0] [c62ef890] [c0b87228] > > >> __schedule+0x288/0x9b0 > > >> [ 401.419347][T0] [c62ef950] [c0b879b8] > > >> schedule+0x68/0x120 > > >> [ 401.419415][T0] [c62ef980] [c0b8e664] > > >> schedule_timeout+0x2a4/0x340 > > >> [ 401.419484][T0] [c62efa80] [c0b894ec] > > >> wait_for_completion+0x9c/0x170 > > >> [ 401.419552][T0] [c62efae0] [c01ac85c] > > >> __wait_rcu_gp+0x19c/0x210 > > >> [ 401.419619][T0] [c62efb40] [c01ac90c] > > >> synchronize_rcu_tasks_generic+0x3c/0x70 > > >> [ 401.419690][T0] [c62efbe0] [c022a3dc] > > >> kprobe_optimizer+0x1dc/0x470 > > >> [ 401.419757][T0] [c62efc60] [c0136684] > > >> process_one_work+0x2f4/0x530 > > >> [ 401.419823][T0] [c62efd20] [c0138d28] > > >> worker_thread+0x78/0x570 > > >> [ 401.419891][T0] [c62efdb0] [c0142424] > > >> kthread+0x194/0x1a0 > > >> [ 401.419976][T0] [c62efe20] [c000daf0] > > >> ret_from_kernel_thread+0x5c/0x6c > > >> > > >> But why is the synchronize_rcu_tasks() not completing? OK. Seems i have understood why the synchronize_rcu_tasks() is not doing progress waiting on complition. Actually the GP kthreads are not spawned by the time when early_initcall callbacks gets invoked. It means that callbacks will not be processed because GP kthreads do not exist, so wakeme_after_rcu() is not invoked, thus does not signal about that a grace period has elapsed. diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index 35bdcfd84d42..c5422bba7fe7 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -241,7 +241,7 @@ static int __noreturn rcu_tasks_kthread
Re: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback
On Thu, Jun 13, 2024 at 05:47:08AM -0700, Paul E. McKenney wrote: > On Thu, Jun 13, 2024 at 01:58:59PM +0200, Jason A. Donenfeld wrote: > > On Wed, Jun 12, 2024 at 03:37:55PM -0700, Paul E. McKenney wrote: > > > On Wed, Jun 12, 2024 at 02:33:05PM -0700, Jakub Kicinski wrote: > > > > On Sun, 9 Jun 2024 10:27:12 +0200 Julia Lawall wrote: > > > > > Since SLOB was removed, it is not necessary to use call_rcu > > > > > when the callback only performs kmem_cache_free. Use > > > > > kfree_rcu() directly. > > > > > > > > > > The changes were done using the following Coccinelle semantic patch. > > > > > This semantic patch is designed to ignore cases where the callback > > > > > function is used in another way. > > > > > > > > How does the discussion on: > > > > [PATCH] Revert "batman-adv: prefer kfree_rcu() over call_rcu() with > > > > free-only callbacks" > > > > > > > > https://lore.kernel.org/all/20240612133357.2596-1-linus.luess...@c0d3.blue/ > > > > reflect on this series? IIUC we should hold off.. > > > > > > We do need to hold off for the ones in kernel modules (such as 07/14) > > > where the kmem_cache is destroyed during module unload. > > > > > > OK, I might as well go through them... > > > > > > [PATCH 01/14] wireguard: allowedips: replace call_rcu by kfree_rcu for > > > simple kmem_cache_free callback > > > Needs to wait, see wg_allowedips_slab_uninit(). > > > > Also, notably, this patch needs additionally: > > > > diff --git a/drivers/net/wireguard/allowedips.c > > b/drivers/net/wireguard/allowedips.c > > index e4e1638fce1b..c95f6937c3f1 100644 > > --- a/drivers/net/wireguard/allowedips.c > > +++ b/drivers/net/wireguard/allowedips.c > > @@ -377,7 +377,6 @@ int __init wg_allowedips_slab_init(void) > > > > void wg_allowedips_slab_uninit(void) > > { > > - rcu_barrier(); > > kmem_cache_destroy(node_cache); > > } > > > > Once kmem_cache_destroy has been fixed to be deferrable. > > > > I assume the other patches are similar -- an rcu_barrier() can be > > removed. So some manual meddling of these might be in order. > > Assuming that the deferrable kmem_cache_destroy() is the option chosen, > agreed. > void kmem_cache_destroy(struct kmem_cache *s) { int err = -EBUSY; bool rcu_set; if (unlikely(!s) || !kasan_check_byte(s)) return; cpus_read_lock(); mutex_lock(&slab_mutex); rcu_set = s->flags & SLAB_TYPESAFE_BY_RCU; s->refcount--; if (s->refcount) goto out_unlock; err = shutdown_cache(s); WARN(err, "%s %s: Slab cache still has objects when called from %pS", __func__, s->name, (void *)_RET_IP_); ... cpus_read_unlock(); if (!err && !rcu_set) kmem_cache_release(s); } so we have SLAB_TYPESAFE_BY_RCU flag that defers freeing slab-pages and a cache by a grace period. Similar flag can be added, like SLAB_DESTROY_ONCE_FULLY_FREED, in this case a worker rearm itself if there are still objects which should be freed. Any thoughts here? -- Uladzislau Rezki
Re: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback
On Thu, Jun 13, 2024 at 08:06:30AM -0700, Paul E. McKenney wrote: > On Thu, Jun 13, 2024 at 03:06:54PM +0200, Uladzislau Rezki wrote: > > On Thu, Jun 13, 2024 at 05:47:08AM -0700, Paul E. McKenney wrote: > > > On Thu, Jun 13, 2024 at 01:58:59PM +0200, Jason A. Donenfeld wrote: > > > > On Wed, Jun 12, 2024 at 03:37:55PM -0700, Paul E. McKenney wrote: > > > > > On Wed, Jun 12, 2024 at 02:33:05PM -0700, Jakub Kicinski wrote: > > > > > > On Sun, 9 Jun 2024 10:27:12 +0200 Julia Lawall wrote: > > > > > > > Since SLOB was removed, it is not necessary to use call_rcu > > > > > > > when the callback only performs kmem_cache_free. Use > > > > > > > kfree_rcu() directly. > > > > > > > > > > > > > > The changes were done using the following Coccinelle semantic > > > > > > > patch. > > > > > > > This semantic patch is designed to ignore cases where the callback > > > > > > > function is used in another way. > > > > > > > > > > > > How does the discussion on: > > > > > > [PATCH] Revert "batman-adv: prefer kfree_rcu() over call_rcu() > > > > > > with free-only callbacks" > > > > > > > > > > > > https://lore.kernel.org/all/20240612133357.2596-1-linus.luess...@c0d3.blue/ > > > > > > reflect on this series? IIUC we should hold off.. > > > > > > > > > > We do need to hold off for the ones in kernel modules (such as 07/14) > > > > > where the kmem_cache is destroyed during module unload. > > > > > > > > > > OK, I might as well go through them... > > > > > > > > > > [PATCH 01/14] wireguard: allowedips: replace call_rcu by kfree_rcu > > > > > for simple kmem_cache_free callback > > > > > Needs to wait, see wg_allowedips_slab_uninit(). > > > > > > > > Also, notably, this patch needs additionally: > > > > > > > > diff --git a/drivers/net/wireguard/allowedips.c > > > > b/drivers/net/wireguard/allowedips.c > > > > index e4e1638fce1b..c95f6937c3f1 100644 > > > > --- a/drivers/net/wireguard/allowedips.c > > > > +++ b/drivers/net/wireguard/allowedips.c > > > > @@ -377,7 +377,6 @@ int __init wg_allowedips_slab_init(void) > > > > > > > > void wg_allowedips_slab_uninit(void) > > > > { > > > > - rcu_barrier(); > > > > kmem_cache_destroy(node_cache); > > > > } > > > > > > > > Once kmem_cache_destroy has been fixed to be deferrable. > > > > > > > > I assume the other patches are similar -- an rcu_barrier() can be > > > > removed. So some manual meddling of these might be in order. > > > > > > Assuming that the deferrable kmem_cache_destroy() is the option chosen, > > > agreed. > > > > > > > void kmem_cache_destroy(struct kmem_cache *s) > > { > > int err = -EBUSY; > > bool rcu_set; > > > > if (unlikely(!s) || !kasan_check_byte(s)) > > return; > > > > cpus_read_lock(); > > mutex_lock(&slab_mutex); > > > > rcu_set = s->flags & SLAB_TYPESAFE_BY_RCU; > > > > s->refcount--; > > if (s->refcount) > > goto out_unlock; > > > > err = shutdown_cache(s); > > WARN(err, "%s %s: Slab cache still has objects when called from %pS", > > __func__, s->name, (void *)_RET_IP_); > > ... > > cpus_read_unlock(); > > if (!err && !rcu_set) > > kmem_cache_release(s); > > } > > > > > > so we have SLAB_TYPESAFE_BY_RCU flag that defers freeing slab-pages > > and a cache by a grace period. Similar flag can be added, like > > SLAB_DESTROY_ONCE_FULLY_FREED, in this case a worker rearm itself > > if there are still objects which should be freed. > > > > Any thoughts here? > > Wouldn't we also need some additional code to later check for all objects > being freed to the slab, whether or not that code is initiated from > kmem_cache_destroy()? > Same away as SLAB_TYPESAFE_BY_RCU is handled from the kmem_cache_destroy() function. It checks that flag and if it is true and extra worker is scheduled to perform a deferred(instead of right away) destroy after rcu_barrier() finishes. -- Uladzislau Rezki
Re: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback
On Thu, Jun 13, 2024 at 10:45:59AM -0700, Paul E. McKenney wrote: > On Thu, Jun 13, 2024 at 07:38:59PM +0200, Uladzislau Rezki wrote: > > On Thu, Jun 13, 2024 at 08:06:30AM -0700, Paul E. McKenney wrote: > > > On Thu, Jun 13, 2024 at 03:06:54PM +0200, Uladzislau Rezki wrote: > > > > On Thu, Jun 13, 2024 at 05:47:08AM -0700, Paul E. McKenney wrote: > > > > > On Thu, Jun 13, 2024 at 01:58:59PM +0200, Jason A. Donenfeld wrote: > > > > > > On Wed, Jun 12, 2024 at 03:37:55PM -0700, Paul E. McKenney wrote: > > > > > > > On Wed, Jun 12, 2024 at 02:33:05PM -0700, Jakub Kicinski wrote: > > > > > > > > On Sun, 9 Jun 2024 10:27:12 +0200 Julia Lawall wrote: > > > > > > > > > Since SLOB was removed, it is not necessary to use call_rcu > > > > > > > > > when the callback only performs kmem_cache_free. Use > > > > > > > > > kfree_rcu() directly. > > > > > > > > > > > > > > > > > > The changes were done using the following Coccinelle semantic > > > > > > > > > patch. > > > > > > > > > This semantic patch is designed to ignore cases where the > > > > > > > > > callback > > > > > > > > > function is used in another way. > > > > > > > > > > > > > > > > How does the discussion on: > > > > > > > > [PATCH] Revert "batman-adv: prefer kfree_rcu() over > > > > > > > > call_rcu() with free-only callbacks" > > > > > > > > > > > > > > > > https://lore.kernel.org/all/20240612133357.2596-1-linus.luess...@c0d3.blue/ > > > > > > > > reflect on this series? IIUC we should hold off.. > > > > > > > > > > > > > > We do need to hold off for the ones in kernel modules (such as > > > > > > > 07/14) > > > > > > > where the kmem_cache is destroyed during module unload. > > > > > > > > > > > > > > OK, I might as well go through them... > > > > > > > > > > > > > > [PATCH 01/14] wireguard: allowedips: replace call_rcu by > > > > > > > kfree_rcu for simple kmem_cache_free callback > > > > > > > Needs to wait, see wg_allowedips_slab_uninit(). > > > > > > > > > > > > Also, notably, this patch needs additionally: > > > > > > > > > > > > diff --git a/drivers/net/wireguard/allowedips.c > > > > > > b/drivers/net/wireguard/allowedips.c > > > > > > index e4e1638fce1b..c95f6937c3f1 100644 > > > > > > --- a/drivers/net/wireguard/allowedips.c > > > > > > +++ b/drivers/net/wireguard/allowedips.c > > > > > > @@ -377,7 +377,6 @@ int __init wg_allowedips_slab_init(void) > > > > > > > > > > > > void wg_allowedips_slab_uninit(void) > > > > > > { > > > > > > - rcu_barrier(); > > > > > > kmem_cache_destroy(node_cache); > > > > > > } > > > > > > > > > > > > Once kmem_cache_destroy has been fixed to be deferrable. > > > > > > > > > > > > I assume the other patches are similar -- an rcu_barrier() can be > > > > > > removed. So some manual meddling of these might be in order. > > > > > > > > > > Assuming that the deferrable kmem_cache_destroy() is the option > > > > > chosen, > > > > > agreed. > > > > > > > > > > > > > void kmem_cache_destroy(struct kmem_cache *s) > > > > { > > > > int err = -EBUSY; > > > > bool rcu_set; > > > > > > > > if (unlikely(!s) || !kasan_check_byte(s)) > > > > return; > > > > > > > > cpus_read_lock(); > > > > mutex_lock(&slab_mutex); > > > > > > > > rcu_set = s->flags & SLAB_TYPESAFE_BY_RCU; > > > > > > > > s->refcount--; > > > > if (s->refcount) > > > > goto out_unlock; > > > > > > > > err = shutdown_cache(s); > > > > WARN(err, "%s %s: Slab cache still has objects when called from >
Re: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback
On Thu, Jun 13, 2024 at 11:13:52AM -0700, Paul E. McKenney wrote: > On Thu, Jun 13, 2024 at 07:58:17PM +0200, Uladzislau Rezki wrote: > > On Thu, Jun 13, 2024 at 10:45:59AM -0700, Paul E. McKenney wrote: > > > On Thu, Jun 13, 2024 at 07:38:59PM +0200, Uladzislau Rezki wrote: > > > > On Thu, Jun 13, 2024 at 08:06:30AM -0700, Paul E. McKenney wrote: > > > > > On Thu, Jun 13, 2024 at 03:06:54PM +0200, Uladzislau Rezki wrote: > > > > > > On Thu, Jun 13, 2024 at 05:47:08AM -0700, Paul E. McKenney wrote: > > > > > > > On Thu, Jun 13, 2024 at 01:58:59PM +0200, Jason A. Donenfeld > > > > > > > wrote: > > > > > > > > On Wed, Jun 12, 2024 at 03:37:55PM -0700, Paul E. McKenney > > > > > > > > wrote: > > > > > > > > > On Wed, Jun 12, 2024 at 02:33:05PM -0700, Jakub Kicinski > > > > > > > > > wrote: > > > > > > > > > > On Sun, 9 Jun 2024 10:27:12 +0200 Julia Lawall wrote: > > > > > > > > > > > Since SLOB was removed, it is not necessary to use > > > > > > > > > > > call_rcu > > > > > > > > > > > when the callback only performs kmem_cache_free. Use > > > > > > > > > > > kfree_rcu() directly. > > > > > > > > > > > > > > > > > > > > > > The changes were done using the following Coccinelle > > > > > > > > > > > semantic patch. > > > > > > > > > > > This semantic patch is designed to ignore cases where the > > > > > > > > > > > callback > > > > > > > > > > > function is used in another way. > > > > > > > > > > > > > > > > > > > > How does the discussion on: > > > > > > > > > > [PATCH] Revert "batman-adv: prefer kfree_rcu() over > > > > > > > > > > call_rcu() with free-only callbacks" > > > > > > > > > > > > > > > > > > > > https://lore.kernel.org/all/20240612133357.2596-1-linus.luess...@c0d3.blue/ > > > > > > > > > > reflect on this series? IIUC we should hold off.. > > > > > > > > > > > > > > > > > > We do need to hold off for the ones in kernel modules (such > > > > > > > > > as 07/14) > > > > > > > > > where the kmem_cache is destroyed during module unload. > > > > > > > > > > > > > > > > > > OK, I might as well go through them... > > > > > > > > > > > > > > > > > > [PATCH 01/14] wireguard: allowedips: replace call_rcu by > > > > > > > > > kfree_rcu for simple kmem_cache_free callback > > > > > > > > > Needs to wait, see wg_allowedips_slab_uninit(). > > > > > > > > > > > > > > > > Also, notably, this patch needs additionally: > > > > > > > > > > > > > > > > diff --git a/drivers/net/wireguard/allowedips.c > > > > > > > > b/drivers/net/wireguard/allowedips.c > > > > > > > > index e4e1638fce1b..c95f6937c3f1 100644 > > > > > > > > --- a/drivers/net/wireguard/allowedips.c > > > > > > > > +++ b/drivers/net/wireguard/allowedips.c > > > > > > > > @@ -377,7 +377,6 @@ int __init wg_allowedips_slab_init(void) > > > > > > > > > > > > > > > > void wg_allowedips_slab_uninit(void) > > > > > > > > { > > > > > > > > - rcu_barrier(); > > > > > > > > kmem_cache_destroy(node_cache); > > > > > > > > } > > > > > > > > > > > > > > > > Once kmem_cache_destroy has been fixed to be deferrable. > > > > > > > > > > > > > > > > I assume the other patches are similar -- an rcu_barrier() can > > > > > > > > be > > > > > > > > removed. So some manual meddling of these might be in order. > > > > > > > > > > > > > > Assuming that the deferrable kmem_cache_destroy() is the option > > > > > > > chosen, > >
Re: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback
On Fri, Jun 14, 2024 at 07:17:29AM -0700, Paul E. McKenney wrote: > On Fri, Jun 14, 2024 at 02:35:33PM +0200, Uladzislau Rezki wrote: > > On Thu, Jun 13, 2024 at 11:13:52AM -0700, Paul E. McKenney wrote: > > > On Thu, Jun 13, 2024 at 07:58:17PM +0200, Uladzislau Rezki wrote: > > > > On Thu, Jun 13, 2024 at 10:45:59AM -0700, Paul E. McKenney wrote: > > > > > On Thu, Jun 13, 2024 at 07:38:59PM +0200, Uladzislau Rezki wrote: > > > > > > On Thu, Jun 13, 2024 at 08:06:30AM -0700, Paul E. McKenney wrote: > > > > > > > On Thu, Jun 13, 2024 at 03:06:54PM +0200, Uladzislau Rezki wrote: > > > > > > > > On Thu, Jun 13, 2024 at 05:47:08AM -0700, Paul E. McKenney > > > > > > > > wrote: > > > > > > > > > On Thu, Jun 13, 2024 at 01:58:59PM +0200, Jason A. Donenfeld > > > > > > > > > wrote: > > > > > > > > > > On Wed, Jun 12, 2024 at 03:37:55PM -0700, Paul E. McKenney > > > > > > > > > > wrote: > > > > > > > > > > > On Wed, Jun 12, 2024 at 02:33:05PM -0700, Jakub Kicinski > > > > > > > > > > > wrote: > > > > > > > > > > > > On Sun, 9 Jun 2024 10:27:12 +0200 Julia Lawall wrote: > > > > > > > > > > > > > Since SLOB was removed, it is not necessary to use > > > > > > > > > > > > > call_rcu > > > > > > > > > > > > > when the callback only performs kmem_cache_free. Use > > > > > > > > > > > > > kfree_rcu() directly. > > > > > > > > > > > > > > > > > > > > > > > > > > The changes were done using the following Coccinelle > > > > > > > > > > > > > semantic patch. > > > > > > > > > > > > > This semantic patch is designed to ignore cases where > > > > > > > > > > > > > the callback > > > > > > > > > > > > > function is used in another way. > > > > > > > > > > > > > > > > > > > > > > > > How does the discussion on: > > > > > > > > > > > > [PATCH] Revert "batman-adv: prefer kfree_rcu() over > > > > > > > > > > > > call_rcu() with free-only callbacks" > > > > > > > > > > > > > > > > > > > > > > > > https://lore.kernel.org/all/20240612133357.2596-1-linus.luess...@c0d3.blue/ > > > > > > > > > > > > reflect on this series? IIUC we should hold off.. > > > > > > > > > > > > > > > > > > > > > > We do need to hold off for the ones in kernel modules > > > > > > > > > > > (such as 07/14) > > > > > > > > > > > where the kmem_cache is destroyed during module unload. > > > > > > > > > > > > > > > > > > > > > > OK, I might as well go through them... > > > > > > > > > > > > > > > > > > > > > > [PATCH 01/14] wireguard: allowedips: replace call_rcu by > > > > > > > > > > > kfree_rcu for simple kmem_cache_free callback > > > > > > > > > > > Needs to wait, see wg_allowedips_slab_uninit(). > > > > > > > > > > > > > > > > > > > > Also, notably, this patch needs additionally: > > > > > > > > > > > > > > > > > > > > diff --git a/drivers/net/wireguard/allowedips.c > > > > > > > > > > b/drivers/net/wireguard/allowedips.c > > > > > > > > > > index e4e1638fce1b..c95f6937c3f1 100644 > > > > > > > > > > --- a/drivers/net/wireguard/allowedips.c > > > > > > > > > > +++ b/drivers/net/wireguard/allowedips.c > > > > > > > > > > @@ -377,7 +377,6 @@ int __init wg_allowedips_slab_init(void) > > > > > > > > > > > > > > > > > > > > void wg_allowedips_slab_uninit(void) > > > > > > > > > > { > > > > > > > > > > - rcu_barrier(); > >
Re: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback
On Fri, Jun 14, 2024 at 09:33:45PM +0200, Jason A. Donenfeld wrote: > On Fri, Jun 14, 2024 at 02:35:33PM +0200, Uladzislau Rezki wrote: > > + /* Should a destroy process be deferred? */ > > + if (s->flags & SLAB_DEFER_DESTROY) { > > + list_move_tail(&s->list, &slab_caches_defer_destroy); > > + schedule_delayed_work(&slab_caches_defer_destroy_work, HZ); > > + goto out_unlock; > > + } > > Wouldn't it be smoother to have the actual kmem_cache_free() function > check to see if it's been marked for destruction and the refcount is > zero, rather than polling every one second? I mentioned this approach > in: https://lore.kernel.org/all/zmo9-ygraicj5...@zx2c4.com/ - > > I wonder if the right fix to this would be adding a `should_destroy` > boolean to kmem_cache, which kmem_cache_destroy() sets to true. And > then right after it checks `if (number_of_allocations == 0) > actually_destroy()`, and likewise on each kmem_cache_free(), it > could check `if (should_destroy && number_of_allocations == 0) > actually_destroy()`. > I do not find pooling as bad way we can go with. But your proposal sounds reasonable to me also. We can combine both "prototypes" to one and offer. Can you post a prototype here? Thanks! -- Uladzislau Rezki
Re: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback
On Mon, Jun 17, 2024 at 04:56:17PM +0200, Jason A. Donenfeld wrote: > On Mon, Jun 17, 2024 at 03:50:56PM +0200, Uladzislau Rezki wrote: > > On Fri, Jun 14, 2024 at 09:33:45PM +0200, Jason A. Donenfeld wrote: > > > On Fri, Jun 14, 2024 at 02:35:33PM +0200, Uladzislau Rezki wrote: > > > > + /* Should a destroy process be deferred? */ > > > > + if (s->flags & SLAB_DEFER_DESTROY) { > > > > + list_move_tail(&s->list, &slab_caches_defer_destroy); > > > > + schedule_delayed_work(&slab_caches_defer_destroy_work, > > > > HZ); > > > > + goto out_unlock; > > > > + } > > > > > > Wouldn't it be smoother to have the actual kmem_cache_free() function > > > check to see if it's been marked for destruction and the refcount is > > > zero, rather than polling every one second? I mentioned this approach > > > in: https://lore.kernel.org/all/zmo9-ygraicj5...@zx2c4.com/ - > > > > > > I wonder if the right fix to this would be adding a `should_destroy` > > > boolean to kmem_cache, which kmem_cache_destroy() sets to true. And > > > then right after it checks `if (number_of_allocations == 0) > > > actually_destroy()`, and likewise on each kmem_cache_free(), it > > > could check `if (should_destroy && number_of_allocations == 0) > > > actually_destroy()`. > > > > > I do not find pooling as bad way we can go with. But your proposal > > sounds reasonable to me also. We can combine both "prototypes" to > > one and offer. > > > > Can you post a prototype here? > > This is untested, but the simplest, shortest possible version would be: > > diff --git a/mm/slab.h b/mm/slab.h > index 5f8f47c5bee0..907c0ea56c01 100644 > --- a/mm/slab.h > +++ b/mm/slab.h > @@ -275,6 +275,7 @@ struct kmem_cache { > unsigned int inuse; /* Offset to metadata */ > unsigned int align; /* Alignment */ > unsigned int red_left_pad; /* Left redzone padding size */ > + bool is_destroyed; /* Destruction happens when no objects > */ > const char *name; /* Name (only for display!) */ > struct list_head list; /* List of slab caches */ > #ifdef CONFIG_SYSFS > diff --git a/mm/slab_common.c b/mm/slab_common.c > index 1560a1546bb1..f700bed066d9 100644 > --- a/mm/slab_common.c > +++ b/mm/slab_common.c > @@ -494,8 +494,8 @@ void kmem_cache_destroy(struct kmem_cache *s) > goto out_unlock; > > err = shutdown_cache(s); > - WARN(err, "%s %s: Slab cache still has objects when called from %pS", > - __func__, s->name, (void *)_RET_IP_); > + if (err) > + s->is_destroyed = true; > Here if an "err" is less then "0" means there are still objects whereas "is_destroyed" is set to "true" which is not correlated with a comment: "Destruction happens when no objects" > out_unlock: > mutex_unlock(&slab_mutex); > cpus_read_unlock(); > diff --git a/mm/slub.c b/mm/slub.c > index 1373ac365a46..7db8fe90a323 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -4510,6 +4510,8 @@ void kmem_cache_free(struct kmem_cache *s, void *x) > return; > trace_kmem_cache_free(_RET_IP_, x, s); > slab_free(s, virt_to_slab(x), x, _RET_IP_); > + if (s->is_destroyed) > + kmem_cache_destroy(s); > } > EXPORT_SYMBOL(kmem_cache_free); > > @@ -5342,9 +5344,6 @@ static void free_partial(struct kmem_cache *s, struct > kmem_cache_node *n) > if (!slab->inuse) { > remove_partial(n, slab); > list_add(&slab->slab_list, &discard); > - } else { > - list_slab_objects(s, slab, > - "Objects remaining in %s on __kmem_cache_shutdown()"); > } > } > spin_unlock_irq(&n->list_lock); > Anyway it looks like it was not welcome to do it in the kmem_cache_free() function due to performance reason. -- Uladzislau Rezki
Re: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback
On Mon, Jun 17, 2024 at 06:33:23PM +0200, Jason A. Donenfeld wrote: > On Mon, Jun 17, 2024 at 6:30 PM Uladzislau Rezki wrote: > > Here if an "err" is less then "0" means there are still objects > > whereas "is_destroyed" is set to "true" which is not correlated > > with a comment: > > > > "Destruction happens when no objects" > > The comment is just poorly written. But the logic of the code is right. > OK. > > > > > out_unlock: > > > mutex_unlock(&slab_mutex); > > > cpus_read_unlock(); > > > diff --git a/mm/slub.c b/mm/slub.c > > > index 1373ac365a46..7db8fe90a323 100644 > > > --- a/mm/slub.c > > > +++ b/mm/slub.c > > > @@ -4510,6 +4510,8 @@ void kmem_cache_free(struct kmem_cache *s, void *x) > > > return; > > > trace_kmem_cache_free(_RET_IP_, x, s); > > > slab_free(s, virt_to_slab(x), x, _RET_IP_); > > > + if (s->is_destroyed) > > > + kmem_cache_destroy(s); > Here i am not follow you. How do you see that a cache has been fully freed? Or is it just super draft code? Thanks! -- Uladzislau Rezki
Re: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback
On Mon, Jun 17, 2024 at 06:57:45PM +0200, Jason A. Donenfeld wrote: > On Mon, Jun 17, 2024 at 06:42:23PM +0200, Uladzislau Rezki wrote: > > On Mon, Jun 17, 2024 at 06:33:23PM +0200, Jason A. Donenfeld wrote: > > > On Mon, Jun 17, 2024 at 6:30 PM Uladzislau Rezki wrote: > > > > Here if an "err" is less then "0" means there are still objects > > > > whereas "is_destroyed" is set to "true" which is not correlated > > > > with a comment: > > > > > > > > "Destruction happens when no objects" > > > > > > The comment is just poorly written. But the logic of the code is right. > > > > > OK. > > > > > > > > > > > out_unlock: > > > > > mutex_unlock(&slab_mutex); > > > > > cpus_read_unlock(); > > > > > diff --git a/mm/slub.c b/mm/slub.c > > > > > index 1373ac365a46..7db8fe90a323 100644 > > > > > --- a/mm/slub.c > > > > > +++ b/mm/slub.c > > > > > @@ -4510,6 +4510,8 @@ void kmem_cache_free(struct kmem_cache *s, void > > > > > *x) > > > > > return; > > > > > trace_kmem_cache_free(_RET_IP_, x, s); > > > > > slab_free(s, virt_to_slab(x), x, _RET_IP_); > > > > > + if (s->is_destroyed) > > > > > + kmem_cache_destroy(s); > > > > > Here i am not follow you. How do you see that a cache has been fully > > freed? Or is it just super draft code? > > kmem_cache_destroy() does this in shutdown_cache(). > Right. In this scenario you invoke kmem_cache_destroy() over and over until the last object gets freed. This potentially slowing the kmem_cache_free() which is not OK, at least to me. -- Uladzislau Rezki
Re: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback
eate_cache(const char *name, > > s->refcount = 1; > list_add(&s->list, &slab_caches); > + INIT_WORK(&s->async_destroy_work, kmem_cache_kfree_rcu_destroy_workfn); > return s; > > out_free_cache: > @@ -449,12 +452,16 @@ static void slab_caches_to_rcu_destroy_workfn(struct > work_struct *work) > } > } > > -static int shutdown_cache(struct kmem_cache *s) > +static int shutdown_cache(struct kmem_cache *s, bool warn_inuse) > { > /* free asan quarantined objects */ > + /* > + * XXX: is it ok to call this multiple times? and what happens with a > + * kfree_rcu() in flight that finishes after or in parallel with this? > + */ > kasan_cache_shutdown(s); > > - if (__kmem_cache_shutdown(s) != 0) > + if (__kmem_cache_shutdown(s, warn_inuse) != 0) > return -EBUSY; > > list_del(&s->list); > @@ -477,6 +484,32 @@ void slab_kmem_cache_release(struct kmem_cache *s) > kmem_cache_free(kmem_cache, s); > } > > +static void kmem_cache_kfree_rcu_destroy_workfn(struct work_struct *work) > +{ > + struct kmem_cache *s; > + int err = -EBUSY; > + bool rcu_set; > + > + s = container_of(work, struct kmem_cache, async_destroy_work); > + > + // XXX use the real kmem_cache_free_barrier() or similar thing here It implies that we need to introduce kfree_rcu_barrier(), a new API, which i wanted to avoid initially. Since you do it asynchronous can we just repeat and wait until it a cache is furry freed? I am asking because inventing a new kfree_rcu_barrier() might not be so straight forward. -- Uladzislau Rezki
Re: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback
> On 6/17/24 8:42 PM, Uladzislau Rezki wrote: > >> + > >> + s = container_of(work, struct kmem_cache, async_destroy_work); > >> + > >> + // XXX use the real kmem_cache_free_barrier() or similar thing here > > It implies that we need to introduce kfree_rcu_barrier(), a new API, which i > > wanted to avoid initially. > > I wanted to avoid new API or flags for kfree_rcu() users and this would > be achieved. The barrier is used internally so I don't consider that an > API to avoid. How difficult is the implementation is another question, > depending on how the current batching works. Once (if) we have sheaves > proven to work and move kfree_rcu() fully into SLUB, the barrier might > also look different and hopefully easier. So maybe it's not worth to > invest too much into that barrier and just go for the potentially > longer, but easier to implement? > Right. I agree here. If the cache is not empty, OK, we just defer the work, even we can use a big 21 seconds delay, after that we just "warn" if it is still not empty and leave it as it is, i.e. emit a warning and we are done. Destroying the cache is not something that must happen right away. > > Since you do it asynchronous can we just repeat > > and wait until it a cache is furry freed? > > The problem is we want to detect the cases when it's not fully freed > because there was an actual read. So at some point we'd need to stop the > repeats because we know there can no longer be any kfree_rcu()'s in > flight since the kmem_cache_destroy() was called. > Agree. As noted above, we can go with 21 seconds(as an example) interval and just perform destroy(without repeating). -- Uladzislau Rezki
Re: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback
On Tue, Jun 18, 2024 at 09:48:49AM -0700, Paul E. McKenney wrote: > On Tue, Jun 18, 2024 at 11:31:00AM +0200, Uladzislau Rezki wrote: > > > On 6/17/24 8:42 PM, Uladzislau Rezki wrote: > > > >> + > > > >> + s = container_of(work, struct kmem_cache, async_destroy_work); > > > >> + > > > >> + // XXX use the real kmem_cache_free_barrier() or similar thing > > > >> here > > > > It implies that we need to introduce kfree_rcu_barrier(), a new API, > > > > which i > > > > wanted to avoid initially. > > > > > > I wanted to avoid new API or flags for kfree_rcu() users and this would > > > be achieved. The barrier is used internally so I don't consider that an > > > API to avoid. How difficult is the implementation is another question, > > > depending on how the current batching works. Once (if) we have sheaves > > > proven to work and move kfree_rcu() fully into SLUB, the barrier might > > > also look different and hopefully easier. So maybe it's not worth to > > > invest too much into that barrier and just go for the potentially > > > longer, but easier to implement? > > > > > Right. I agree here. If the cache is not empty, OK, we just defer the > > work, even we can use a big 21 seconds delay, after that we just "warn" > > if it is still not empty and leave it as it is, i.e. emit a warning and > > we are done. > > > > Destroying the cache is not something that must happen right away. > > OK, I have to ask... > > Suppose that the cache is created and destroyed by a module and > init/cleanup time, respectively. Suppose that this module is rmmod'ed > then very quickly insmod'ed. > > Do we need to fail the insmod if the kmem_cache has not yet been fully > cleaned up? If not, do we have two versions of the same kmem_cache in > /proc during the overlap time? > No fail :) If same cache is created several times, its s->refcount gets increased, so, it does not create two entries in the "slabinfo". But i agree that your point is good! We need to be carefully with removing and simultaneous creating. >From the first glance, there is a refcounter and a global "slab_mutex" which is used to protect a critical section. Destroying is almost fully protected(as noted above, by a global mutex) with one exception, it is: static void kmem_cache_release(struct kmem_cache *s) { if (slab_state >= FULL) { sysfs_slab_unlink(s); sysfs_slab_release(s); } else { slab_kmem_cache_release(s); } } this one can race, IMO. -- Uladzislau Rezki
Re: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback
On Wed, Jun 19, 2024 at 11:56:44AM +0200, Vlastimil Babka wrote: > On 6/19/24 11:51 AM, Uladzislau Rezki wrote: > > On Tue, Jun 18, 2024 at 09:48:49AM -0700, Paul E. McKenney wrote: > >> On Tue, Jun 18, 2024 at 11:31:00AM +0200, Uladzislau Rezki wrote: > >> > > On 6/17/24 8:42 PM, Uladzislau Rezki wrote: > >> > > >> + > >> > > >> + s = container_of(work, struct kmem_cache, async_destroy_work); > >> > > >> + > >> > > >> + // XXX use the real kmem_cache_free_barrier() or similar thing > >> > > >> here > >> > > > It implies that we need to introduce kfree_rcu_barrier(), a new API, > >> > > > which i > >> > > > wanted to avoid initially. > >> > > > >> > > I wanted to avoid new API or flags for kfree_rcu() users and this would > >> > > be achieved. The barrier is used internally so I don't consider that an > >> > > API to avoid. How difficult is the implementation is another question, > >> > > depending on how the current batching works. Once (if) we have sheaves > >> > > proven to work and move kfree_rcu() fully into SLUB, the barrier might > >> > > also look different and hopefully easier. So maybe it's not worth to > >> > > invest too much into that barrier and just go for the potentially > >> > > longer, but easier to implement? > >> > > > >> > Right. I agree here. If the cache is not empty, OK, we just defer the > >> > work, even we can use a big 21 seconds delay, after that we just "warn" > >> > if it is still not empty and leave it as it is, i.e. emit a warning and > >> > we are done. > >> > > >> > Destroying the cache is not something that must happen right away. > >> > >> OK, I have to ask... > >> > >> Suppose that the cache is created and destroyed by a module and > >> init/cleanup time, respectively. Suppose that this module is rmmod'ed > >> then very quickly insmod'ed. > >> > >> Do we need to fail the insmod if the kmem_cache has not yet been fully > >> cleaned up? If not, do we have two versions of the same kmem_cache in > >> /proc during the overlap time? > >> > > No fail :) If same cache is created several times, its s->refcount gets > > increased, so, it does not create two entries in the "slabinfo". But i > > agree that your point is good! We need to be carefully with removing and > > simultaneous creating. > > Note that this merging may be disabled or not happen due to various flags on > the cache being incompatible with it. And I want to actually make sure it > never happens for caches being already destroyed as that would lead to > use-after-free (the workfn doesn't recheck the refcount in case a merge > would happen during the grace period) > > --- a/mm/slab_common.c > +++ b/mm/slab_common.c > @@ -150,9 +150,10 @@ int slab_unmergeable(struct kmem_cache *s) > #endif > > /* > -* We may have set a slab to be unmergeable during bootstrap. > +* We may have set a cache to be unmergeable during bootstrap. > +* 0 is for cache being destroyed asynchronously > */ > - if (s->refcount < 0) > + if (s->refcount <= 0) > return 1; > > return 0; > OK, i see such flags, SLAB_NO_MERGE. Then i was wrong, it can create two different slabs. Thanks! -- Uladzislau Rezki
Re: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback
On Wed, Jun 19, 2024 at 11:28:13AM +0200, Vlastimil Babka wrote: > On 6/18/24 7:53 PM, Paul E. McKenney wrote: > > On Tue, Jun 18, 2024 at 07:21:42PM +0200, Vlastimil Babka wrote: > >> On 6/18/24 6:48 PM, Paul E. McKenney wrote: > >> > On Tue, Jun 18, 2024 at 11:31:00AM +0200, Uladzislau Rezki wrote: > >> >> > On 6/17/24 8:42 PM, Uladzislau Rezki wrote: > >> >> > >> + > >> >> > >> + s = container_of(work, struct kmem_cache, async_destroy_work); > >> >> > >> + > >> >> > >> + // XXX use the real kmem_cache_free_barrier() or similar thing > >> >> > >> here > >> >> > > It implies that we need to introduce kfree_rcu_barrier(), a new > >> >> > > API, which i > >> >> > > wanted to avoid initially. > >> >> > > >> >> > I wanted to avoid new API or flags for kfree_rcu() users and this > >> >> > would > >> >> > be achieved. The barrier is used internally so I don't consider that > >> >> > an > >> >> > API to avoid. How difficult is the implementation is another question, > >> >> > depending on how the current batching works. Once (if) we have sheaves > >> >> > proven to work and move kfree_rcu() fully into SLUB, the barrier might > >> >> > also look different and hopefully easier. So maybe it's not worth to > >> >> > invest too much into that barrier and just go for the potentially > >> >> > longer, but easier to implement? > >> >> > > >> >> Right. I agree here. If the cache is not empty, OK, we just defer the > >> >> work, even we can use a big 21 seconds delay, after that we just "warn" > >> >> if it is still not empty and leave it as it is, i.e. emit a warning and > >> >> we are done. > >> >> > >> >> Destroying the cache is not something that must happen right away. > >> > > >> > OK, I have to ask... > >> > > >> > Suppose that the cache is created and destroyed by a module and > >> > init/cleanup time, respectively. Suppose that this module is rmmod'ed > >> > then very quickly insmod'ed. > >> > > >> > Do we need to fail the insmod if the kmem_cache has not yet been fully > >> > cleaned up? > >> > >> We don't have any such link between kmem_cache and module to detect that, > >> so > >> we would have to start tracking that. Probably not worth the trouble. > > > > Fair enough! > > > >> > If not, do we have two versions of the same kmem_cache in > >> > /proc during the overlap time? > >> > >> Hm could happen in /proc/slabinfo but without being harmful other than > >> perhaps confusing someone. We could filter out the caches being destroyed > >> trivially. > > > > Or mark them in /proc/slabinfo? Yet another column, yay!!! Or script > > breakage from flagging the name somehow, for example, trailing "/" > > character. > > Yeah I've been resisting such changes to the layout and this wouldn't be > worth it, apart from changing the name itself but not in a dangerous way > like with "/" :) > > >> Sysfs and debugfs might be more problematic as I suppose directory names > >> would clash. I'll have to check... might be even happening now when we do > >> detect leaked objects and just leave the cache around... thanks for the > >> question. > > > > "It is a service that I provide." ;-) > > > > But yes, we might be living with it already and there might already > > be ways people deal with it. > > So it seems if the sysfs/debugfs directories already exist, they will > silently not be created. Wonder if we have such cases today already because > caches with same name exist. I think we do with the zsmalloc using 32 caches > with same name that we discussed elsewhere just recently. > > Also indeed if the cache has leaked objects and won't be thus destroyed, > these directories indeed stay around, as well as the slabinfo entry, and can > prevent new ones from being created (slabinfo lines with same name are not > prevented). > > But it wouldn't be great to introduce this possibility to happen for the > temporarily delayed removal due to kfree_rcu() and a module re-insert, since > that's a legitimate case and not buggy st
Re: [PATCH 13/17] KVM: PPC: replace call_rcu by kfree_rcu for simple kmem_cache_free callback
On Sun, Oct 13, 2024 at 10:17:00PM +0200, Julia Lawall wrote: > Since SLOB was removed and since > commit 6c6c47b063b5 ("mm, slab: call kvfree_rcu_barrier() from > kmem_cache_destroy()"), > it is not necessary to use call_rcu when the callback only performs > kmem_cache_free. Use kfree_rcu() directly. > > The changes were made using Coccinelle. > > Signed-off-by: Julia Lawall > > --- > arch/powerpc/kvm/book3s_mmu_hpte.c |8 +--- > 1 file changed, 1 insertion(+), 7 deletions(-) > > diff --git a/arch/powerpc/kvm/book3s_mmu_hpte.c > b/arch/powerpc/kvm/book3s_mmu_hpte.c > index ce79ac33e8d3..d904e13e069b 100644 > --- a/arch/powerpc/kvm/book3s_mmu_hpte.c > +++ b/arch/powerpc/kvm/book3s_mmu_hpte.c > @@ -92,12 +92,6 @@ void kvmppc_mmu_hpte_cache_map(struct kvm_vcpu *vcpu, > struct hpte_cache *pte) > spin_unlock(&vcpu3s->mmu_lock); > } > > -static void free_pte_rcu(struct rcu_head *head) > -{ > - struct hpte_cache *pte = container_of(head, struct hpte_cache, > rcu_head); > - kmem_cache_free(hpte_cache, pte); > -} > - > static void invalidate_pte(struct kvm_vcpu *vcpu, struct hpte_cache *pte) > { > struct kvmppc_vcpu_book3s *vcpu3s = to_book3s(vcpu); > @@ -126,7 +120,7 @@ static void invalidate_pte(struct kvm_vcpu *vcpu, struct > hpte_cache *pte) > > spin_unlock(&vcpu3s->mmu_lock); > > - call_rcu(&pte->rcu_head, free_pte_rcu); > + kfree_rcu(pte, rcu_head); > } > > static void kvmppc_mmu_pte_flush_all(struct kvm_vcpu *vcpu) > > Reviewed-by: Uladzislau Rezki (Sony) -- Uladzislau Rezki
Re: [PATCH v6 1/8] mm: vmalloc: group declarations depending on CONFIG_MMU together
On Wed, Oct 16, 2024 at 03:24:17PM +0300, Mike Rapoport wrote: > From: "Mike Rapoport (Microsoft)" > > There are a couple of declarations that depend on CONFIG_MMU in > include/linux/vmalloc.h spread all over the file. > > Group them all together to improve code readability. > > No functional changes. > > Signed-off-by: Mike Rapoport (Microsoft) > Reviewed-by: Christoph Hellwig > --- > include/linux/vmalloc.h | 60 + > 1 file changed, 24 insertions(+), 36 deletions(-) > > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h > index ad2ce7a6ab7a..27408f21e501 100644 > --- a/include/linux/vmalloc.h > +++ b/include/linux/vmalloc.h > @@ -134,12 +134,6 @@ extern void vm_unmap_ram(const void *mem, unsigned int > count); > extern void *vm_map_ram(struct page **pages, unsigned int count, int node); > extern void vm_unmap_aliases(void); > > -#ifdef CONFIG_MMU > -extern unsigned long vmalloc_nr_pages(void); > -#else > -static inline unsigned long vmalloc_nr_pages(void) { return 0; } > -#endif > - > extern void *vmalloc_noprof(unsigned long size) __alloc_size(1); > #define vmalloc(...) alloc_hooks(vmalloc_noprof(__VA_ARGS__)) > > @@ -266,12 +260,29 @@ static inline bool is_vm_area_hugepages(const void > *addr) > #endif > } > > +/* for /proc/kcore */ > +long vread_iter(struct iov_iter *iter, const char *addr, size_t count); > + > +/* > + * Internals. Don't use.. > + */ > +__init void vm_area_add_early(struct vm_struct *vm); > +__init void vm_area_register_early(struct vm_struct *vm, size_t align); > + > +int register_vmap_purge_notifier(struct notifier_block *nb); > +int unregister_vmap_purge_notifier(struct notifier_block *nb); > + > #ifdef CONFIG_MMU > +#define VMALLOC_TOTAL (VMALLOC_END - VMALLOC_START) > + > +unsigned long vmalloc_nr_pages(void); > + > int vm_area_map_pages(struct vm_struct *area, unsigned long start, > unsigned long end, struct page **pages); > void vm_area_unmap_pages(struct vm_struct *area, unsigned long start, >unsigned long end); > void vunmap_range(unsigned long addr, unsigned long end); > + > static inline void set_vm_flush_reset_perms(void *addr) > { > struct vm_struct *vm = find_vm_area(addr); > @@ -279,24 +290,14 @@ static inline void set_vm_flush_reset_perms(void *addr) > if (vm) > vm->flags |= VM_FLUSH_RESET_PERMS; > } > +#else /* !CONFIG_MMU */ > +#define VMALLOC_TOTAL 0UL > > -#else > -static inline void set_vm_flush_reset_perms(void *addr) > -{ > -} > -#endif > - > -/* for /proc/kcore */ > -extern long vread_iter(struct iov_iter *iter, const char *addr, size_t > count); > - > -/* > - * Internals. Don't use.. > - */ > -extern __init void vm_area_add_early(struct vm_struct *vm); > -extern __init void vm_area_register_early(struct vm_struct *vm, size_t > align); > +static inline unsigned long vmalloc_nr_pages(void) { return 0; } > +static inline void set_vm_flush_reset_perms(void *addr) {} > +#endif /* CONFIG_MMU */ > > -#ifdef CONFIG_SMP > -# ifdef CONFIG_MMU > +#if defined(CONFIG_MMU) && defined(CONFIG_SMP) > struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, >const size_t *sizes, int nr_vms, >size_t align); > @@ -311,22 +312,9 @@ pcpu_get_vm_areas(const unsigned long *offsets, > return NULL; > } > > -static inline void > -pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms) > -{ > -} > -# endif > -#endif > - > -#ifdef CONFIG_MMU > -#define VMALLOC_TOTAL (VMALLOC_END - VMALLOC_START) > -#else > -#define VMALLOC_TOTAL 0UL > +static inline void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms) {} > #endif > > -int register_vmap_purge_notifier(struct notifier_block *nb); > -int unregister_vmap_purge_notifier(struct notifier_block *nb); > - > #if defined(CONFIG_MMU) && defined(CONFIG_PRINTK) > bool vmalloc_dump_obj(void *object); > #else > -- > 2.43.0 > Reviewed-by: Uladzislau Rezki (Sony) -- Uladzislau Rezki
Re: [PATCH v6 2/8] mm: vmalloc: don't account for number of nodes for HUGE_VMAP allocations
On Wed, Oct 16, 2024 at 03:24:18PM +0300, Mike Rapoport wrote: > From: "Mike Rapoport (Microsoft)" > > vmalloc allocations with VM_ALLOW_HUGE_VMAP that do not explicitly > specify node ID will use huge pages only if size_per_node is larger than > a huge page. > Still the actual allocated memory is not distributed between nodes and > there is no advantage in such approach. > On the contrary, BPF allocates SZ_2M * num_possible_nodes() for each > new bpf_prog_pack, while it could do with a single huge page per pack. > > Don't account for number of nodes for VM_ALLOW_HUGE_VMAP with > NUMA_NO_NODE and use huge pages whenever the requested allocation size > is larger than a huge page. > > Signed-off-by: Mike Rapoport (Microsoft) > Reviewed-by: Christoph Hellwig > --- > mm/vmalloc.c | 9 ++--- > 1 file changed, 2 insertions(+), 7 deletions(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 634162271c00..86b2344d7461 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -3763,8 +3763,6 @@ void *__vmalloc_node_range_noprof(unsigned long size, > unsigned long align, > } > > if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP)) { > - unsigned long size_per_node; > - > /* >* Try huge pages. Only try for PAGE_KERNEL allocations, >* others like modules don't yet expect huge pages in > @@ -3772,13 +3770,10 @@ void *__vmalloc_node_range_noprof(unsigned long size, > unsigned long align, >* supporting them. >*/ > > - size_per_node = size; > - if (node == NUMA_NO_NODE) > - size_per_node /= num_online_nodes(); > - if (arch_vmap_pmd_supported(prot) && size_per_node >= PMD_SIZE) > + if (arch_vmap_pmd_supported(prot) && size >= PMD_SIZE) > shift = PMD_SHIFT; > else > - shift = arch_vmap_pte_supported_shift(size_per_node); > + shift = arch_vmap_pte_supported_shift(size); > > align = max(real_align, 1UL << shift); > size = ALIGN(real_size, 1UL << shift); > Looking at this place, i see that an overwriting a "size" approach seems as something that is a bit hard to follow. Below we have following code: ... again: area = __get_vm_area_node(real_size, align, shift, VM_ALLOC | VM_UNINITIALIZED | vm_flags, start, end, node, gfp_mask, caller); ... where we pass a "real_size", whereas there is only one place in the __vmalloc_node_range_noprof() function where a "size" is used. It is in the end of function: ... size = PAGE_ALIGN(size); if (!(vm_flags & VM_DEFER_KMEMLEAK)) kmemleak_vmalloc(area, size, gfp_mask); return area->addr; As fro this patch: Reviewed-by: Uladzislau Rezki (Sony) -- Uladzislau Rezki
Re: [PATCH 2/6] mm: Lock kernel page tables before entering lazy MMU mode
89.29%89.25% [kernel] [k] native_queued_spin_lock_slowpath 69.82% ret_from_fork_asm - ret_from_fork - 69.81% kthread - 69.66% test_func - 26.31% full_fit_alloc_test - 19.11% __vmalloc_node_noprof - __vmalloc_node_range_noprof - 13.73% vmap_small_pages_range_noflush _raw_spin_lock native_queued_spin_lock_slowpath - 5.38% __get_vm_area_node alloc_vmap_area _raw_spin_lock native_queued_spin_lock_slowpath - 13.32% vfree.part.0 - 13.31% remove_vm_area - 13.27% __vunmap_range_noflush _raw_spin_lock native_queued_spin_lock_slowpath - 25.57% fix_size_alloc_test - 22.59% __vmalloc_node_noprof - __vmalloc_node_range_noprof - 17.34% vmap_small_pages_range_noflush _raw_spin_lock native_queued_spin_lock_slowpath - 5.25% __get_vm_area_node alloc_vmap_area _raw_spin_lock native_queued_spin_lock_slowpath - 11.59% vfree.part.0 - remove_vm_area - 11.55% __vunmap_range_noflush _raw_spin_lock native_queued_spin_lock_slowpath - 17.78% long_busy_list_alloc_test - 13.90% __vmalloc_node_noprof - __vmalloc_node_range_noprof - 9.95% vmap_small_pages_range_noflush _raw_spin_lock No, we can not take this patch. -- Uladzislau Rezki
Re: [PATCH 2/6] mm: Lock kernel page tables before entering lazy MMU mode
On Thu, Jun 12, 2025 at 07:36:09PM +0200, Alexander Gordeev wrote: > As a follow-up to commit 691ee97e1a9d ("mm: fix lazy mmu docs and > usage") take a step forward and protect with a lock not only user, > but also kernel mappings before entering the lazy MMU mode. With > that the semantics of arch_enter|leave_lazy_mmu_mode() callbacks > is consolidated, which allows further simplifications. > > The effect of this consolidation is not fully preemptible (Real-Time) > kernels can not enter the context switch while the lazy MMU mode is > active - which is easier to comprehend. > > Signed-off-by: Alexander Gordeev > --- > include/linux/pgtable.h | 12 ++-- > mm/kasan/shadow.c | 5 - > mm/memory.c | 5 - > mm/vmalloc.c| 6 ++ > 4 files changed, 16 insertions(+), 12 deletions(-) > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index 0b6e1f781d86..33bf2b13c219 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -224,12 +224,12 @@ static inline int pmd_dirty(pmd_t pmd) > * a raw PTE pointer after it has been modified are not guaranteed to be > * up to date. > * > - * In the general case, no lock is guaranteed to be held between entry and > exit > - * of the lazy mode. So the implementation must assume preemption may be > enabled > - * and cpu migration is possible; it must take steps to be robust against > this. > - * (In practice, for user PTE updates, the appropriate page table lock(s) are > - * held, but for kernel PTE updates, no lock is held). Nesting is not > permitted > - * and the mode cannot be used in interrupt context. > + * For PREEMPT_RT kernels implementation must assume that preemption may > + * be enabled and cpu migration is possible between entry and exit of the > + * lazy MMU mode; it must take steps to be robust against this. There is > + * no such assumption for non-PREEMPT_RT kernels, since both kernel and > + * user page tables are protected with a spinlock while in lazy MMU mode. > + * Nesting is not permitted and the mode cannot be used in interrupt context. > */ > #ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE > #define arch_enter_lazy_mmu_mode() do {} while (0) > diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c > index d2c70cd2afb1..45115bd770a9 100644 > --- a/mm/kasan/shadow.c > +++ b/mm/kasan/shadow.c > @@ -313,12 +313,10 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, > unsigned long addr, > __memset(page_to_virt(page), KASAN_VMALLOC_INVALID, PAGE_SIZE); > pte = pfn_pte(page_to_pfn(page), PAGE_KERNEL); > > - spin_lock(&init_mm.page_table_lock); > if (likely(pte_none(ptep_get(ptep { > set_pte_at(&init_mm, addr, ptep, pte); > data->pages[index] = NULL; > } > - spin_unlock(&init_mm.page_table_lock); > > return 0; > } > @@ -465,13 +463,10 @@ static int kasan_depopulate_vmalloc_pte(pte_t *ptep, > unsigned long addr, > > page = (unsigned long)__va(pte_pfn(ptep_get(ptep)) << PAGE_SHIFT); > > - spin_lock(&init_mm.page_table_lock); > - > if (likely(!pte_none(ptep_get(ptep { > pte_clear(&init_mm, addr, ptep); > free_page(page); > } > - spin_unlock(&init_mm.page_table_lock); > > return 0; > } > diff --git a/mm/memory.c b/mm/memory.c > index 71b3d3f98999..1ddc532b1f13 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3017,6 +3017,7 @@ static int apply_to_pte_range(struct mm_struct *mm, > pmd_t *pmd, > pte = pte_offset_kernel(pmd, addr); > if (!pte) > return err; > + spin_lock(&init_mm.page_table_lock); > } else { > if (create) > pte = pte_alloc_map_lock(mm, pmd, addr, &ptl); > @@ -3042,7 +3043,9 @@ static int apply_to_pte_range(struct mm_struct *mm, > pmd_t *pmd, > > arch_leave_lazy_mmu_mode(); > > - if (mm != &init_mm) > + if (mm == &init_mm) > + spin_unlock(&init_mm.page_table_lock); > + else > pte_unmap_unlock(mapped_pte, ptl); > > *mask |= PGTBL_PTE_MODIFIED; > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index ab986dd09b6a..57b11000ae36 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -105,6 +105,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, > unsigned long end, > if (!pte) > return -ENOMEM; > > + spin_lock(&init_mm.page_table_lock); > This is not good. We introduce another bottle-neck. -- Uladzislau Rezki
Re: [PATCH 2/3] mm: update core kernel code to use vm_flags_t consistently
On Tue, Jul 29, 2025 at 06:25:39AM +0100, Lorenzo Stoakes wrote: > Andrew - FYI there's nothing to worry about here, the type remains > precisely the same, and I'll send a patch to fix this trivial issue so when > later this type changes vmalloc will be uaffected. > > On Tue, Jul 29, 2025 at 09:15:51AM +0900, Harry Yoo wrote: > > [Adding Uladzislau to Cc] > > Ulad - could we PLEASE get rid of 'vm_flags' in vmalloc? It's the precise > same name and (currently) type as vma->vm_flags and is already the source > of confusion. > You mean all "vm_flags" variable names? "vm_struct" has flags as a member. So you want: urezki@pc638:~/data/backup/coding/linux-not-broken.git$ grep -rn vm_flags mm/execmem.c 29: pgprot_t pgprot, unsigned long vm_flags) 39: vm_flags |= VM_DEFER_KMEMLEAK; 41: if (vm_flags & VM_ALLOW_HUGE_VMAP) 45: pgprot, vm_flags, NUMA_NO_NODE, 51: pgprot, vm_flags, NUMA_NO_NODE, 85: pgprot_t pgprot, unsigned long vm_flags) 259:unsigned long vm_flags = VM_ALLOW_HUGE_VMAP; 266:p = execmem_vmalloc(range, alloc_size, PAGE_KERNEL, vm_flags); 376:unsigned long vm_flags = VM_FLUSH_RESET_PERMS; 385:p = execmem_vmalloc(range, size, pgprot, vm_flags); urezki@pc638:~/data/backup/coding/linux-not-broken.git$ grep -rn vm_flags mm/vmalloc.c 3853: * @vm_flags:additional vm area flags (e.g. %VM_NO_GUARD) 3875: pgprot_t prot, unsigned long vm_flags, int node, 3894: if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP)) { 3912: VM_UNINITIALIZED | vm_flags, start, end, node, 3977: if (!(vm_flags & VM_DEFER_KMEMLEAK)) 4621: vm_flags_set(vma, VM_DONTEXPAND | VM_DONTDUMP); urezki@pc638:~/data/backup/coding/linux-not-broken.git$ grep -rn vm_flags mm/execmem.c 29: pgprot_t pgprot, unsigned long vm_flags) 39: vm_flags |= VM_DEFER_KMEMLEAK; 41: if (vm_flags & VM_ALLOW_HUGE_VMAP) 45: pgprot, vm_flags, NUMA_NO_NODE, 51: pgprot, vm_flags, NUMA_NO_NODE, 85: pgprot_t pgprot, unsigned long vm_flags) 259:unsigned long vm_flags = VM_ALLOW_HUGE_VMAP; 266:p = execmem_vmalloc(range, alloc_size, PAGE_KERNEL, vm_flags); 376:unsigned long vm_flags = VM_FLUSH_RESET_PERMS; 385:p = execmem_vmalloc(range, size, pgprot, vm_flags); urezki@pc638:~/data/backup/coding/linux-not-broken.git$ grep -rn vm_flags ./include/linux/vmalloc.h 172:pgprot_t prot, unsigned long vm_flags, int node, urezki@pc638:~/data/backup/coding/linux-not-broken.git$ to rename all those "vm_flags" to something, for example, like "flags"? Thanks! -- Uladzislau Rezki
Re: [PATCH 2/3] mm: update core kernel code to use vm_flags_t consistently
Hello, Lorenzo! > So sorry Ulad, I meant to get back to you on this sooner! > > On Tue, Jul 29, 2025 at 08:39:01PM +0200, Uladzislau Rezki wrote: > > On Tue, Jul 29, 2025 at 06:25:39AM +0100, Lorenzo Stoakes wrote: > > > Andrew - FYI there's nothing to worry about here, the type remains > > > precisely the same, and I'll send a patch to fix this trivial issue so > > > when > > > later this type changes vmalloc will be uaffected. > > > > > > On Tue, Jul 29, 2025 at 09:15:51AM +0900, Harry Yoo wrote: > > > > [Adding Uladzislau to Cc] > > > > > > Ulad - could we PLEASE get rid of 'vm_flags' in vmalloc? It's the precise > > > same name and (currently) type as vma->vm_flags and is already the source > > > of confusion. > > > > > You mean all "vm_flags" variable names? "vm_struct" has flags as a > > member. So you want: > > > > urezki@pc638:~/data/backup/coding/linux-not-broken.git$ grep -rn vm_flags > > mm/execmem.c > > 29: pgprot_t pgprot, unsigned long vm_flags) > > 39: vm_flags |= VM_DEFER_KMEMLEAK; > > 41: if (vm_flags & VM_ALLOW_HUGE_VMAP) > > 45: pgprot, vm_flags, NUMA_NO_NODE, > > 51: pgprot, vm_flags, NUMA_NO_NODE, > > 85: pgprot_t pgprot, unsigned long vm_flags) > > 259:unsigned long vm_flags = VM_ALLOW_HUGE_VMAP; > > 266:p = execmem_vmalloc(range, alloc_size, PAGE_KERNEL, vm_flags); > > 376:unsigned long vm_flags = VM_FLUSH_RESET_PERMS; > > 385:p = execmem_vmalloc(range, size, pgprot, vm_flags); > > urezki@pc638:~/data/backup/coding/linux-not-broken.git$ grep -rn vm_flags > > mm/vmalloc.c > > 3853: * @vm_flags:additional vm area flags (e.g. > > %VM_NO_GUARD) > > 3875: pgprot_t prot, unsigned long vm_flags, int node, > > 3894: if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP)) { > > 3912: VM_UNINITIALIZED | vm_flags, start, end, > > node, > > 3977: if (!(vm_flags & VM_DEFER_KMEMLEAK)) > > 4621: vm_flags_set(vma, VM_DONTEXPAND | VM_DONTDUMP); > > urezki@pc638:~/data/backup/coding/linux-not-broken.git$ grep -rn vm_flags > > mm/execmem.c > > 29: pgprot_t pgprot, unsigned long vm_flags) > > 39: vm_flags |= VM_DEFER_KMEMLEAK; > > 41: if (vm_flags & VM_ALLOW_HUGE_VMAP) > > 45: pgprot, vm_flags, NUMA_NO_NODE, > > 51: pgprot, vm_flags, NUMA_NO_NODE, > > 85: pgprot_t pgprot, unsigned long vm_flags) > > 259:unsigned long vm_flags = VM_ALLOW_HUGE_VMAP; > > 266:p = execmem_vmalloc(range, alloc_size, PAGE_KERNEL, vm_flags); > > 376:unsigned long vm_flags = VM_FLUSH_RESET_PERMS; > > 385:p = execmem_vmalloc(range, size, pgprot, vm_flags); > > urezki@pc638:~/data/backup/coding/linux-not-broken.git$ grep -rn vm_flags > > ./include/linux/vmalloc.h > > 172:pgprot_t prot, unsigned long vm_flags, int node, > > urezki@pc638:~/data/backup/coding/linux-not-broken.git$ > > > > to rename all those "vm_flags" to something, for example, like "flags"? > > Yeah, sorry I know it's a churny pain, but I think it's such a silly source > of confusion _in general_, not only this series where I made a mistake (of > course entirely my fault but certainly more understandable given the > naming), but in the past I've certainly sat there thinking 'hmmm wait' :) > > Really I think we should rename 'vm_struct' too, but if that causes _too > much_ churn fair enough. > > I think even though it's long-winded, 'vmalloc_flags' would be good, both > in fields and local params as it makes things very very clear. > > > Equally 'vm_struct' -> 'vmalloc_struct' would be a good change. > Uh.. This could be a pain :) I will have a look and see what we can do. Thanks! -- Uladzislau Rezki
Re: [PATCH 2/3] mm: update core kernel code to use vm_flags_t consistently
On Tue, Aug 05, 2025 at 12:37:57PM +0300, Mike Rapoport wrote: > On Mon, Aug 04, 2025 at 12:54:21PM +0200, Uladzislau Rezki wrote: > > Hello, Lorenzo! > > > > > So sorry Ulad, I meant to get back to you on this sooner! > > > > > > On Tue, Jul 29, 2025 at 08:39:01PM +0200, Uladzislau Rezki wrote: > > > > On Tue, Jul 29, 2025 at 06:25:39AM +0100, Lorenzo Stoakes wrote: > > > > > Andrew - FYI there's nothing to worry about here, the type remains > > > > > precisely the same, and I'll send a patch to fix this trivial issue > > > > > so when > > > > > later this type changes vmalloc will be uaffected. > > > > > > > > > > On Tue, Jul 29, 2025 at 09:15:51AM +0900, Harry Yoo wrote: > > > > > > [Adding Uladzislau to Cc] > > > > > > > > > > Ulad - could we PLEASE get rid of 'vm_flags' in vmalloc? It's the > > > > > precise > > > > > same name and (currently) type as vma->vm_flags and is already the > > > > > source > > > > > of confusion. > > > > > > > > > You mean all "vm_flags" variable names? "vm_struct" has flags as a > > > > member. So you want: > > > > > > > > urezki@pc638:~/data/backup/coding/linux-not-broken.git$ grep -rn > > > > vm_flags mm/execmem.c > > > > 29: pgprot_t pgprot, unsigned long vm_flags) > > > > 39: vm_flags |= VM_DEFER_KMEMLEAK; > > > > 41: if (vm_flags & VM_ALLOW_HUGE_VMAP) > > > > 45: pgprot, vm_flags, NUMA_NO_NODE, > > > > 51: pgprot, vm_flags, NUMA_NO_NODE, > > > > 85: pgprot_t pgprot, unsigned long vm_flags) > > > > 259:unsigned long vm_flags = VM_ALLOW_HUGE_VMAP; > > > > 266:p = execmem_vmalloc(range, alloc_size, PAGE_KERNEL, vm_flags); > > > > 376:unsigned long vm_flags = VM_FLUSH_RESET_PERMS; > > > > 385:p = execmem_vmalloc(range, size, pgprot, vm_flags); > > > > urezki@pc638:~/data/backup/coding/linux-not-broken.git$ grep -rn > > > > vm_flags mm/vmalloc.c > > > > 3853: * @vm_flags:additional vm area flags (e.g. > > > > %VM_NO_GUARD) > > > > 3875: pgprot_t prot, unsigned long vm_flags, int node, > > > > 3894: if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP)) { > > > > 3912: VM_UNINITIALIZED | vm_flags, start, > > > > end, node, > > > > 3977: if (!(vm_flags & VM_DEFER_KMEMLEAK)) > > > > 4621: vm_flags_set(vma, VM_DONTEXPAND | VM_DONTDUMP); > > > > urezki@pc638:~/data/backup/coding/linux-not-broken.git$ grep -rn > > > > vm_flags mm/execmem.c > > > > 29: pgprot_t pgprot, unsigned long vm_flags) > > > > 39: vm_flags |= VM_DEFER_KMEMLEAK; > > > > 41: if (vm_flags & VM_ALLOW_HUGE_VMAP) > > > > 45: pgprot, vm_flags, NUMA_NO_NODE, > > > > 51: pgprot, vm_flags, NUMA_NO_NODE, > > > > 85: pgprot_t pgprot, unsigned long vm_flags) > > > > 259:unsigned long vm_flags = VM_ALLOW_HUGE_VMAP; > > > > 266:p = execmem_vmalloc(range, alloc_size, PAGE_KERNEL, vm_flags); > > > > 376:unsigned long vm_flags = VM_FLUSH_RESET_PERMS; > > > > 385:p = execmem_vmalloc(range, size, pgprot, vm_flags); > > > > urezki@pc638:~/data/backup/coding/linux-not-broken.git$ grep -rn > > > > vm_flags ./include/linux/vmalloc.h > > > > 172:pgprot_t prot, unsigned long vm_flags, int node, > > > > urezki@pc638:~/data/backup/coding/linux-not-broken.git$ > > > > > > > > to rename all those "vm_flags" to something, for example, like "flags"? > > > > > > Yeah, sorry I know it's a churny pain, but I think it's such a silly > > > source > > > of confusion _in general_, not only this series where I made a mistake (of > > > course entirely my fault but certainly more understandable given the > > > naming), but in the past I've certainly sat there thinking 'hmmm wait' :) > > > > > > Really I think we should rename 'vm_struct' too, but if that causes _to