Re: [PATCH] mm/slab: rename CONFIG_SLAB to CONFIG_SLAB_DEPRECATED
On Tue, May 23, 2023 at 6:12 PM Vlastimil Babka wrote: > > As discussed at LSF/MM [1] [2] and with no objections raised there, > deprecate the SLAB allocator. Rename the user-visible option so that > users with CONFIG_SLAB=y get a new prompt with explanation during make > oldconfig, while make olddefconfig will just switch to SLUB. > > In all defconfigs with CONFIG_SLAB=y remove the line so those also > switch to SLUB. Regressions due to the switch should be reported to > linux-mm and slab maintainers. > > [1] https://lore.kernel.org/all/4b9fc9c6-b48c-198f-5f80-811a44737...@suse.cz/ > [2] https://lwn.net/Articles/932201/ > > Signed-off-by: Vlastimil Babka > --- > diff --git a/mm/Kconfig b/mm/Kconfig > index 7672a22647b4..b537c4436d18 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -218,11 +218,18 @@ choice > help >This option allows to select a slab allocator. > > -config SLAB > - bool "SLAB" > +config SLAB_DEPRECATED > + bool "SLAB (DEPRECATED)" > depends on !PREEMPT_RT > select HAVE_HARDENED_USERCOPY_ALLOCATOR > help > + Deprecated and scheduled for removal in a few cycles. Replaced by > + SLUB. > + > + If you cannot migrate to SLUB, please contact linux...@kvack.org > + and the people listed in the SLAB ALLOCATOR section of MAINTAINERS > + file, explaining why. > + > The regular slab allocator that is established and known to work > well in all environments. It organizes cache hot objects in > per cpu and per node queues. > @@ -240,6 +247,11 @@ config SLUB > > endchoice > > +config SLAB > + bool > + default y > + depends on SLAB_DEPRECATED > + > config SLUB_TINY > bool "Configure SLUB for minimal memory footprint" > depends on SLUB && EXPERT > -- > 2.40.1 Thank you for the work! It looks good to me. let's see some users raise their voice. Acked-by: Hyeonggon Yoo <42.hye...@gmail.com>
Re: [PATCH v4 4/7] mm: replace vma->vm_flags direct modifications with modifier calls
On Thu, Jan 26, 2023 at 11:37:49AM -0800, Suren Baghdasaryan wrote: > Replace direct modifications to vma->vm_flags with calls to modifier > functions to be able to track flag changes and to keep vma locking > correctness. > > Signed-off-by: Suren Baghdasaryan > Acked-by: Michal Hocko > Acked-by: Mel Gorman > Acked-by: Mike Rapoport (IBM) > Acked-by: Sebastian Reichel > --- > arch/arm/kernel/process.c | 2 +- > arch/ia64/mm/init.c| 8 > arch/loongarch/include/asm/tlb.h | 2 +- > arch/powerpc/kvm/book3s_xive_native.c | 2 +- > arch/powerpc/mm/book3s64/subpage_prot.c| 2 +- > arch/powerpc/platforms/book3s/vas-api.c| 2 +- > arch/powerpc/platforms/cell/spufs/file.c | 14 +++--- > arch/s390/mm/gmap.c| 3 +-- > arch/x86/entry/vsyscall/vsyscall_64.c | 2 +- > arch/x86/kernel/cpu/sgx/driver.c | 2 +- > arch/x86/kernel/cpu/sgx/virt.c | 2 +- > arch/x86/mm/pat/memtype.c | 6 +++--- > arch/x86/um/mem_32.c | 2 +- > drivers/acpi/pfr_telemetry.c | 2 +- > drivers/android/binder.c | 3 +-- > drivers/char/mspec.c | 2 +- > drivers/crypto/hisilicon/qm.c | 2 +- > drivers/dax/device.c | 2 +- > drivers/dma/idxd/cdev.c| 2 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c| 2 +- > drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 4 ++-- > drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c | 4 ++-- > drivers/gpu/drm/amd/amdkfd/kfd_events.c| 4 ++-- > drivers/gpu/drm/amd/amdkfd/kfd_process.c | 4 ++-- > drivers/gpu/drm/drm_gem.c | 2 +- > drivers/gpu/drm/drm_gem_dma_helper.c | 3 +-- > drivers/gpu/drm/drm_gem_shmem_helper.c | 2 +- > drivers/gpu/drm/drm_vm.c | 8 > drivers/gpu/drm/etnaviv/etnaviv_gem.c | 2 +- > drivers/gpu/drm/exynos/exynos_drm_gem.c| 4 ++-- > drivers/gpu/drm/gma500/framebuffer.c | 2 +- > drivers/gpu/drm/i810/i810_dma.c| 2 +- > drivers/gpu/drm/i915/gem/i915_gem_mman.c | 4 ++-- > drivers/gpu/drm/mediatek/mtk_drm_gem.c | 2 +- > drivers/gpu/drm/msm/msm_gem.c | 2 +- > drivers/gpu/drm/omapdrm/omap_gem.c | 3 +-- > drivers/gpu/drm/rockchip/rockchip_drm_gem.c| 3 +-- > drivers/gpu/drm/tegra/gem.c| 5 ++--- > drivers/gpu/drm/ttm/ttm_bo_vm.c| 3 +-- > drivers/gpu/drm/virtio/virtgpu_vram.c | 2 +- > drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c | 2 +- > drivers/gpu/drm/xen/xen_drm_front_gem.c| 3 +-- > drivers/hsi/clients/cmt_speech.c | 2 +- > drivers/hwtracing/intel_th/msu.c | 2 +- > drivers/hwtracing/stm/core.c | 2 +- > drivers/infiniband/hw/hfi1/file_ops.c | 4 ++-- > drivers/infiniband/hw/mlx5/main.c | 4 ++-- > drivers/infiniband/hw/qib/qib_file_ops.c | 13 ++--- > drivers/infiniband/hw/usnic/usnic_ib_verbs.c | 2 +- > drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.c| 2 +- > .../media/common/videobuf2/videobuf2-dma-contig.c | 2 +- > drivers/media/common/videobuf2/videobuf2-vmalloc.c | 2 +- > drivers/media/v4l2-core/videobuf-dma-contig.c | 2 +- > drivers/media/v4l2-core/videobuf-dma-sg.c | 4 ++-- > drivers/media/v4l2-core/videobuf-vmalloc.c | 2 +- > drivers/misc/cxl/context.c | 2 +- > drivers/misc/habanalabs/common/memory.c| 2 +- > drivers/misc/habanalabs/gaudi/gaudi.c | 4 ++-- > drivers/misc/habanalabs/gaudi2/gaudi2.c| 8 > drivers/misc/habanalabs/goya/goya.c| 4 ++-- > drivers/misc/ocxl/context.c| 4 ++-- > drivers/misc/ocxl/sysfs.c | 2 +- > drivers/misc/open-dice.c | 4 ++-- > drivers/misc/sgi-gru/grufile.c | 4 ++-- > drivers/misc/uacce/uacce.c | 2 +- > drivers/sbus/char/oradax.c | 2 +- > drivers/scsi/cxlflash/ocxl_hw.c| 2 +- > drivers/scsi/sg.c | 2 +- > drivers/staging/media/atomisp/pci/hmm/hmm_bo.c | 2 +- > drivers/staging/media/deprecated/meye/meye.c | 4 ++-- > .../media/deprecated/stkwebcam/stk-webcam.c| 2 +- > drivers/target/target_core_user.c | 2 +- > drivers/uio/uio.c | 2 +- > dr
Re: [PATCH v4 4/7] mm: replace vma->vm_flags direct modifications with modifier calls
On Tue, Jan 31, 2023 at 10:54:22AM -0800, Suren Baghdasaryan wrote: > On Tue, Jan 31, 2023 at 12:32 AM Hyeonggon Yoo <42.hye...@gmail.com> wrote: > > > > On Thu, Jan 26, 2023 at 11:37:49AM -0800, Suren Baghdasaryan wrote: > > > Replace direct modifications to vma->vm_flags with calls to modifier > > > functions to be able to track flag changes and to keep vma locking > > > correctness. > > > > > > Signed-off-by: Suren Baghdasaryan > > > Acked-by: Michal Hocko > > > Acked-by: Mel Gorman > > > Acked-by: Mike Rapoport (IBM) > > > Acked-by: Sebastian Reichel > > > --- > > > arch/arm/kernel/process.c | 2 +- > > > 120 files changed, 188 insertions(+), 199 deletions(-) > > > > > > > Hello Suren, > > Hi Hyeonggon, > > > > > [...] > > > > Whoa, it's so long. > > Mostly looks fine but two things I'm not sure about: > > > > > diff --git a/drivers/misc/open-dice.c b/drivers/misc/open-dice.c > > > index 9dda47b3fd70..7be4e6c9f120 100644 > > > --- a/drivers/misc/open-dice.c > > > +++ b/drivers/misc/open-dice.c > > > @@ -95,12 +95,12 @@ static int open_dice_mmap(struct file *filp, struct > > > vm_area_struct *vma) > > > if (vma->vm_flags & VM_WRITE) > > > return -EPERM; > > > /* Ensure userspace cannot acquire VM_WRITE later. */ > > > - vma->vm_flags &= ~VM_MAYWRITE; > > > + vm_flags_clear(vma, VM_MAYSHARE); > > > } > > > > I think it should be: > > s/VM_MAYSHARE/VM_MAYWRITE/ > > Good eye! Yes, this is definitely a bug. Will post a next version with this > fix. > > > > > > diff --git a/mm/mlock.c b/mm/mlock.c > > > index 5c4fff93cd6b..ed49459e343e 100644 > > > --- a/mm/mlock.c > > > +++ b/mm/mlock.c > > > @@ -380,7 +380,7 @@ static void mlock_vma_pages_range(struct > > > vm_area_struct *vma, > > >*/ > > > if (newflags & VM_LOCKED) > > > newflags |= VM_IO; > > > - WRITE_ONCE(vma->vm_flags, newflags); > > > + vm_flags_reset(vma, newflags); > > > > > > lru_add_drain(); > > > walk_page_range(vma->vm_mm, start, end, &mlock_walk_ops, NULL); > > > @@ -388,7 +388,7 @@ static void mlock_vma_pages_range(struct > > > vm_area_struct *vma, > > > > > > if (newflags & VM_IO) { > > > newflags &= ~VM_IO; > > > - WRITE_ONCE(vma->vm_flags, newflags); > > > + vm_flags_reset(vma, newflags); > > > } > > > } > > > > wondering the if the comment above is still true? > > > > /* > > * There is a slight chance that concurrent page migration, > > * or page reclaim finding a page of this now-VM_LOCKED vma, > > * will call mlock_vma_folio() and raise page's mlock_count: > > * double counting, leaving the page unevictable indefinitely. > > * Communicate this danger to mlock_vma_folio() with VM_IO, > > * which is a VM_SPECIAL flag not allowed on VM_LOCKED vmas. > > * mmap_lock is held in write mode here, so this weird > > * combination should not be visible to other mmap_lock users; > > * but WRITE_ONCE so rmap walkers must see VM_IO if VM_LOCKED. > > */ > > > > does ACCESS_PRIVATE() still guarentee that compiler cannot mysteriously > > optimize writes like WRITE_ONCE()? > > I don't see ACCESS_PRIVATE() providing the same guarantees as > WRITE_ONCE(), therefore I think this also needs to be changed. I'll > need to introduce something like vm_flags_reset_once() and use it > here. vm_flags_reset_once() would do WRITE_ONCE() and otherwise would > be identical to vm_flags_reset(). > > I'll post a new version with the fixes later today. > > Thanks for the review! > Suren. Thanks for quick reply! Andrew's fix and the new patch looks good to me. with these two things addressed: Reviewed-by: Hyeonggon Yoo <42.hye...@gmail.com> Regards, Hyeonggon
Re: [PATCH 1/1] mm: introduce vm_flags_reset_once to replace WRITE_ONCE vm_flags updates
On Tue, Jan 31, 2023 at 04:01:16PM -0800, Suren Baghdasaryan wrote: > Provide vm_flags_reset_once() and replace the vm_flags updates which used > WRITE_ONCE() to prevent compiler optimizations. > > Fixes: 0cce31a0aa0e ("mm: replace vma->vm_flags direct modifications with > modifier calls") > Reported-by: Hyeonggon Yoo <42.hye...@gmail.com> > Signed-off-by: Suren Baghdasaryan > --- > Notes: > - The patch applies cleanly over mm-unstable > - The SHA in Fixes: line is from mm-unstable, so is... unstable > > include/linux/mm.h | 7 +++ > mm/mlock.c | 4 ++-- > 2 files changed, 9 insertions(+), 2 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 5bf0ad48faaa..23ce04f6e91e 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -648,6 +648,13 @@ static inline void vm_flags_reset(struct vm_area_struct > *vma, > vm_flags_init(vma, flags); > } > > +static inline void vm_flags_reset_once(struct vm_area_struct *vma, > +vm_flags_t flags) > +{ > + mmap_assert_write_locked(vma->vm_mm); > + WRITE_ONCE(ACCESS_PRIVATE(vma, __vm_flags), flags); > +} > + > static inline void vm_flags_set(struct vm_area_struct *vma, > vm_flags_t flags) > { > diff --git a/mm/mlock.c b/mm/mlock.c > index ed49459e343e..617469fce96d 100644 > --- a/mm/mlock.c > +++ b/mm/mlock.c > @@ -380,7 +380,7 @@ static void mlock_vma_pages_range(struct vm_area_struct > *vma, >*/ > if (newflags & VM_LOCKED) > newflags |= VM_IO; > - vm_flags_reset(vma, newflags); > + vm_flags_reset_once(vma, newflags); > > lru_add_drain(); > walk_page_range(vma->vm_mm, start, end, &mlock_walk_ops, NULL); > @@ -388,7 +388,7 @@ static void mlock_vma_pages_range(struct vm_area_struct > *vma, > > if (newflags & VM_IO) { > newflags &= ~VM_IO; > - vm_flags_reset(vma, newflags); > + vm_flags_reset_once(vma, newflags); > } > } Reviewed-by: Hyeonggon Yoo <42.hye...@gmail.com> > > -- > 2.39.1.456.gfc5497dd1b-goog >
Re: [PATCH v4 2/7] mm: introduce vma->vm_flags wrapper functions
On Thu, Jan 26, 2023 at 11:37:47AM -0800, Suren Baghdasaryan wrote: > vm_flags are among VMA attributes which affect decisions like VMA merging > and splitting. Therefore all vm_flags modifications are performed after > taking exclusive mmap_lock to prevent vm_flags updates racing with such > operations. Introduce modifier functions for vm_flags to be used whenever > flags are updated. This way we can better check and control correct > locking behavior during these updates. > > Signed-off-by: Suren Baghdasaryan > Reviewed-by: Davidlohr Bueso > Acked-by: Michal Hocko > Acked-by: Mel Gorman > --- > include/linux/mm.h | 40 > include/linux/mm_types.h | 10 +- > 2 files changed, 49 insertions(+), 1 deletion(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 8d636e895ee9..abb31103d060 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -627,6 +627,46 @@ static inline void vma_init(struct vm_area_struct *vma, > struct mm_struct *mm) > INIT_LIST_HEAD(&vma->anon_vma_chain); > } > > +/* Use when VMA is not part of the VMA tree and needs no locking */ > +static inline void vm_flags_init(struct vm_area_struct *vma, > + vm_flags_t flags) > +{ > + ACCESS_PRIVATE(vma, __vm_flags) = flags; > +} > + > +/* Use when VMA is part of the VMA tree and modifications need coordination > */ > +static inline void vm_flags_reset(struct vm_area_struct *vma, > + vm_flags_t flags) > +{ > + mmap_assert_write_locked(vma->vm_mm); > + vm_flags_init(vma, flags); > +} > + > +static inline void vm_flags_set(struct vm_area_struct *vma, > + vm_flags_t flags) > +{ > + mmap_assert_write_locked(vma->vm_mm); > + ACCESS_PRIVATE(vma, __vm_flags) |= flags; > +} > + > +static inline void vm_flags_clear(struct vm_area_struct *vma, > + vm_flags_t flags) > +{ > + mmap_assert_write_locked(vma->vm_mm); > + ACCESS_PRIVATE(vma, __vm_flags) &= ~flags; > +} > + > +/* > + * Use only when the order of set/clear operations is unimportant, otherwise > + * use vm_flags_{set|clear} explicitly. > + */ > +static inline void vm_flags_mod(struct vm_area_struct *vma, > + vm_flags_t set, vm_flags_t clear) > +{ > + mmap_assert_write_locked(vma->vm_mm); > + vm_flags_init(vma, (vma->vm_flags | set) & ~clear); > +} > + > static inline void vma_set_anonymous(struct vm_area_struct *vma) > { > vma->vm_ops = NULL; > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 2d6d790d9bed..da983aedb741 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -491,7 +491,15 @@ struct vm_area_struct { >* See vmf_insert_mixed_prot() for discussion. >*/ > pgprot_t vm_page_prot; > - unsigned long vm_flags; /* Flags, see mm.h. */ > + > + /* > + * Flags, see mm.h. > + * To modify use vm_flags_{init|reset|set|clear|mod} functions. > + */ > + union { > + const vm_flags_t vm_flags; > + vm_flags_t __private __vm_flags; > + }; > > /* >* For areas with an address space and backing store, Reviewed-by: Hyeonggon Yoo <42.hye...@gmail.com> > -- > 2.39.1 > >
Re: [PATCH v4 5/7] mm: replace vma->vm_flags indirect modification in ksm_madvise
On Thu, Jan 26, 2023 at 11:37:50AM -0800, Suren Baghdasaryan wrote: > Replace indirect modifications to vma->vm_flags with calls to modifier > functions to be able to track flag changes and to keep vma locking > correctness. > > Signed-off-by: Suren Baghdasaryan > Acked-by: Michal Hocko > Acked-by: Mel Gorman > Acked-by: Mike Rapoport (IBM) > --- > arch/powerpc/kvm/book3s_hv_uvmem.c | 6 +- > arch/s390/mm/gmap.c| 6 +- > 2 files changed, 10 insertions(+), 2 deletions(-) > > diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c > b/arch/powerpc/kvm/book3s_hv_uvmem.c > index 1d67baa5557a..709ebd578394 100644 > --- a/arch/powerpc/kvm/book3s_hv_uvmem.c > +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c > @@ -393,6 +393,7 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm, > { > unsigned long gfn = memslot->base_gfn; > unsigned long end, start = gfn_to_hva(kvm, gfn); > + unsigned long vm_flags; > int ret = 0; > struct vm_area_struct *vma; > int merge_flag = (merge) ? MADV_MERGEABLE : MADV_UNMERGEABLE; > @@ -409,12 +410,15 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm, > ret = H_STATE; > break; > } > + /* Copy vm_flags to avoid partial modifications in ksm_madvise > */ > + vm_flags = vma->vm_flags; > ret = ksm_madvise(vma, vma->vm_start, vma->vm_end, > - merge_flag, &vma->vm_flags); > + merge_flag, &vm_flags); > if (ret) { > ret = H_STATE; > break; > } > + vm_flags_reset(vma, vm_flags); > start = vma->vm_end; > } while (end > vma->vm_end); > > diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c > index ab836597419d..5a716bdcba05 100644 > --- a/arch/s390/mm/gmap.c > +++ b/arch/s390/mm/gmap.c > @@ -2587,14 +2587,18 @@ int gmap_mark_unmergeable(void) > { > struct mm_struct *mm = current->mm; > struct vm_area_struct *vma; > + unsigned long vm_flags; > int ret; > VMA_ITERATOR(vmi, mm, 0); > > for_each_vma(vmi, vma) { > + /* Copy vm_flags to avoid partial modifications in ksm_madvise > */ > + vm_flags = vma->vm_flags; > ret = ksm_madvise(vma, vma->vm_start, vma->vm_end, > - MADV_UNMERGEABLE, &vma->vm_flags); > + MADV_UNMERGEABLE, &vm_flags); > if (ret) > return ret; > + vm_flags_reset(vma, vm_flags); > } > mm->def_flags &= ~VM_MERGEABLE; > return 0; > -- Reviewed-by: Hyeonggon Yoo <42.hye...@gmail.com> > 2.39.1 > >
Re: [PATCH v3 26/35] mm: fall back to mmap_lock if vma->anon_vma is not yet set
On Fri, Feb 17, 2023 at 11:15 AM Suren Baghdasaryan wrote: > > On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan wrote: > > > > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox wrote: > > > > > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote: > > > > When vma->anon_vma is not set, page fault handler will set it by either > > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by > > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find > > > > a compatible adjacent VMA and that requires not only the faulting VMA > > > > to be stable but also the tree structure and other VMAs inside that > > > > tree. > > > > Therefore locking just the faulting VMA is not enough for this search. > > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This > > > > situation happens only on the first page fault and should not affect > > > > overall performance. > > > > > > I think I asked this before, but don't remember getting an aswer. > > > Why do we defer setting anon_vma to the first fault? Why don't we > > > set it up at mmap time? > > > > Yeah, I remember that conversation Matthew and I could not find the > > definitive answer at the time. I'll look into that again or maybe > > someone can answer it here. > > After looking into it again I'm still under the impression that > vma->anon_vma is populated lazily (during the first page fault rather > than at mmap time) to avoid doing extra work for areas which are never > faulted. Though I might be missing some important detail here. I think this is because the kernel cannot merge VMAs that have different anon_vmas? Enabling lazy population of anon_vma could potentially increase the chances of merging VMAs. > > In the end rather than changing that logic I decided to skip > > vma->anon_vma==NULL cases because I measured them being less than > > 0.01% of all page faults, so ROI from changing that would be quite > > low. But I agree that the logic is weird and maybe we can improve > > that. I will have to review that again when I'm working on eliminating > > all these special cases we skip, like swap/userfaults/etc.
Re: [PATCH v3 26/35] mm: fall back to mmap_lock if vma->anon_vma is not yet set
On Fri, Feb 17, 2023 at 08:13:01AM -0800, Suren Baghdasaryan wrote: > On Fri, Feb 17, 2023 at 2:21 AM Hyeonggon Yoo <42.hye...@gmail.com> wrote: > > > > On Fri, Feb 17, 2023 at 11:15 AM Suren Baghdasaryan > > wrote: > > > > > > On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan > > > wrote: > > > > > > > > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox > > > > wrote: > > > > > > > > > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote: > > > > > > When vma->anon_vma is not set, page fault handler will set it by > > > > > > either > > > > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by > > > > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to > > > > > > find > > > > > > a compatible adjacent VMA and that requires not only the faulting > > > > > > VMA > > > > > > to be stable but also the tree structure and other VMAs inside that > > > > > > tree. > > > > > > Therefore locking just the faulting VMA is not enough for this > > > > > > search. > > > > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This > > > > > > situation happens only on the first page fault and should not affect > > > > > > overall performance. > > > > > > > > > > I think I asked this before, but don't remember getting an aswer. > > > > > Why do we defer setting anon_vma to the first fault? Why don't we > > > > > set it up at mmap time? > > > > > > > > Yeah, I remember that conversation Matthew and I could not find the > > > > definitive answer at the time. I'll look into that again or maybe > > > > someone can answer it here. > > > > > > After looking into it again I'm still under the impression that > > > vma->anon_vma is populated lazily (during the first page fault rather > > > than at mmap time) to avoid doing extra work for areas which are never > > > faulted. Though I might be missing some important detail here. > > > > I think this is because the kernel cannot merge VMAs that have > > different anon_vmas? > > > > Enabling lazy population of anon_vma could potentially increase the > > chances of merging VMAs. > > Hmm. Do you have a clear explanation why merging chances increase this > way? A couple of possibilities I can think of would be: > 1. If after mmap'ing a VMA and before faulting the first page into it > we often change something that affects anon_vma_compatible() decision, > like vm_policy; > 2. When mmap'ing VMAs we do not map them consecutively but the final > arrangement is actually contiguous. > > Don't think either of those cases would be very representative of a > usual case but maybe I'm wrong or there is another reason? Ok. I agree it does not represent common cases. Hmm then I wonder how it went from the initial approach of "allocate anon_vma objects only via fork()" [1] to "populate anon_vma at page faults". [2] [3] Maybe Hugh, Andrea or Andrew have opinions? [1] anon_vma RFC2, lore.kernel.org https://lore.kernel.org/lkml/20040311065254.GT30940@dualathlon.random [2] The status of object-based reverse mapping, LWN.net https://lwn.net/Articles/85908 [3] rmap 39 add anon_vma rmap https://gitlab.com/hyeyoo/linux-historical/-/commit/8aa3448cabdfca146aa3fd36e852d0209fb2276a > > > > > > > In the end rather than changing that logic I decided to skip > > > > vma->anon_vma==NULL cases because I measured them being less than > > > > 0.01% of all page faults, so ROI from changing that would be quite > > > > low. But I agree that the logic is weird and maybe we can improve > > > > that. I will have to review that again when I'm working on eliminating > > > > all these special cases we skip, like swap/userfaults/etc. > > > > -- > > To unsubscribe from this group and stop receiving emails from it, send an > > email to kernel-team+unsubscr...@android.com. > >
Re: [PATCH v3 16/35] mm/mmap: write-lock VMAs before merging, splitting or expanding them
On Wed, Feb 15, 2023 at 09:17:31PM -0800, Suren Baghdasaryan wrote: > Decisions about whether VMAs can be merged, split or expanded must be > made while VMAs are protected from the changes which can affect that > decision. For example, merge_vma uses vma->anon_vma in its decision Did you mean vma_merge()? > whether the VMA can be merged. Meanwhile, page fault handler changes > vma->anon_vma during COW operation. > Write-lock all VMAs which might be affected by a merge or split operation > before making decision how such operations should be performed. > It doesn't make sense (to me) to update vma->anon_vma during COW fault. AFAIK children's vma->anon_vma is allocated during fork() and page->anon_vma is updated on COW to reduce rmap walking because it's now unshared, no? As patch 26 just falls back to mmap_lock if no anon_vma is set, I think we can assume nothing updates vma->anon_vma (and its interval tree) if we are holding mmap_lock in write mode. Or am I missing something? -- Regards, Hyeonggon
Re: [PATCH v3 16/35] mm/mmap: write-lock VMAs before merging, splitting or expanding them
On Thu, Feb 23, 2023 at 02:51:27PM +, Hyeonggon Yoo wrote: > On Wed, Feb 15, 2023 at 09:17:31PM -0800, Suren Baghdasaryan wrote: > > Decisions about whether VMAs can be merged, split or expanded must be > > made while VMAs are protected from the changes which can affect that > > decision. For example, merge_vma uses vma->anon_vma in its decision > > Did you mean vma_merge()? > > > whether the VMA can be merged. Meanwhile, page fault handler changes > > vma->anon_vma during COW operation. > > Write-lock all VMAs which might be affected by a merge or split operation > > before making decision how such operations should be performed. > > > > It doesn't make sense (to me) to update vma->anon_vma during COW fault. > > AFAIK children's vma->anon_vma is allocated during fork() and > page->anon_vma is updated on COW to reduce rmap walking because it's now > unshared, no? > > As patch 26 just falls back to mmap_lock if no anon_vma is set, > I think we can assume nothing updates vma->anon_vma (and its interval > tree) if we are holding mmap_lock in write mode. *Not interval tree, as it can be modified by other processes* > > Or am I missing something? > > -- > Regards, > Hyeonggon >
Re: [PATCH v4 17/33] mm/mremap: write-lock VMA while remapping it to a new address range
On Mon, Feb 27, 2023 at 09:36:16AM -0800, Suren Baghdasaryan wrote: > Write-lock VMA as locked before copying it and when copy_vma produces > a new VMA. > > Signed-off-by: Suren Baghdasaryan > Reviewed-by: Laurent Dufour > --- > mm/mmap.c | 1 + > mm/mremap.c | 1 + > 2 files changed, 2 insertions(+) > > diff --git a/mm/mmap.c b/mm/mmap.c > index e73fbb84ce12..1f42b9a52b9b 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3189,6 +3189,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct > **vmap, > get_file(new_vma->vm_file); > if (new_vma->vm_ops && new_vma->vm_ops->open) > new_vma->vm_ops->open(new_vma); > + vma_start_write(new_vma); Oh, it's to prevent handling page faults during move_page_tables(). > if (vma_link(mm, new_vma)) > goto out_vma_link; > *need_rmap_locks = false; > diff --git a/mm/mremap.c b/mm/mremap.c > index 1ddf7beb62e9..327c38eb132e 100644 > --- a/mm/mremap.c > +++ b/mm/mremap.c > @@ -623,6 +623,7 @@ static unsigned long move_vma(struct vm_area_struct *vma, > return -ENOMEM; > } > > + vma_start_write(vma); > new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT); > new_vma = copy_vma(&vma, new_addr, new_len, new_pgoff, > &need_rmap_locks); > -- > 2.39.2.722.g9855ee24e9-goog Looks good to me. Reviewed-by: Hyeonggon Yoo <42.hye...@gmail.com> > >
Re: [PATCH v4 18/33] mm: write-lock VMAs before removing them from VMA tree
On Mon, Feb 27, 2023 at 09:36:17AM -0800, Suren Baghdasaryan wrote: > Write-locking VMAs before isolating them ensures that page fault > handlers don't operate on isolated VMAs. > > Signed-off-by: Suren Baghdasaryan > --- > mm/mmap.c | 1 + > mm/nommu.c | 5 + > 2 files changed, 6 insertions(+) > > diff --git a/mm/mmap.c b/mm/mmap.c > index 1f42b9a52b9b..f7ed357056c4 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -2255,6 +2255,7 @@ int split_vma(struct vma_iterator *vmi, struct > vm_area_struct *vma, > static inline int munmap_sidetree(struct vm_area_struct *vma, > struct ma_state *mas_detach) > { > + vma_start_write(vma); > mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1); I may be missing something, but have few questions: 1) Why does a writer need to both write-lock a VMA and mark the VMA detached when unmapping it, isn't it enough to just only write-lock a VMA? 2) as VMAs that are going to be removed are already locked in vma_prepare(), so I think this hunk could be dropped? > if (mas_store_gfp(mas_detach, vma, GFP_KERNEL)) > return -ENOMEM; > diff --git a/mm/nommu.c b/mm/nommu.c > index 57ba243c6a37..2ab162d773e2 100644 > --- a/mm/nommu.c > +++ b/mm/nommu.c > @@ -588,6 +588,7 @@ static int delete_vma_from_mm(struct vm_area_struct *vma) > current->pid); > return -ENOMEM; > } > + vma_start_write(vma); > cleanup_vma_from_mm(vma); 3) I think this hunk could be dropped as Per-VMA lock depends on MMU anyway. Thanks, Hyeonggon > > /* remove from the MM's tree and list */ > @@ -1519,6 +1520,10 @@ void exit_mmap(struct mm_struct *mm) >*/ > mmap_write_lock(mm); > for_each_vma(vmi, vma) { > + /* > + * No need to lock VMA because this is the only mm user and no > + * page fault handled can race with it. > + */ > cleanup_vma_from_mm(vma); > delete_vma(mm, vma); > cond_resched(); > -- > 2.39.2.722.g9855ee24e9-goog > >
Re: [PATCH v4 18/33] mm: write-lock VMAs before removing them from VMA tree
On Wed, Mar 01, 2023 at 07:43:33AM +, Hyeonggon Yoo wrote: > On Mon, Feb 27, 2023 at 09:36:17AM -0800, Suren Baghdasaryan wrote: > > Write-locking VMAs before isolating them ensures that page fault > > handlers don't operate on isolated VMAs. > > > > Signed-off-by: Suren Baghdasaryan > > --- > > mm/mmap.c | 1 + > > mm/nommu.c | 5 + > > 2 files changed, 6 insertions(+) > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > index 1f42b9a52b9b..f7ed357056c4 100644 > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -2255,6 +2255,7 @@ int split_vma(struct vma_iterator *vmi, struct > > vm_area_struct *vma, > > static inline int munmap_sidetree(struct vm_area_struct *vma, > >struct ma_state *mas_detach) > > { > > + vma_start_write(vma); > > mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1); > > I may be missing something, but have few questions: > > 1) Why does a writer need to both write-lock a VMA and mark the VMA > detached > when unmapping it, isn't it enough to just only write-lock a VMA? > > 2) as VMAs that are going to be removed are already locked in > vma_prepare(), > so I think this hunk could be dropped? After sending this just realized that I did not consider simple munmap case :) But I still think 1) and 3) are valid question. > > > if (mas_store_gfp(mas_detach, vma, GFP_KERNEL)) > > return -ENOMEM; > > diff --git a/mm/nommu.c b/mm/nommu.c > > index 57ba243c6a37..2ab162d773e2 100644 > > --- a/mm/nommu.c > > +++ b/mm/nommu.c > > @@ -588,6 +588,7 @@ static int delete_vma_from_mm(struct vm_area_struct > > *vma) > >current->pid); > > return -ENOMEM; > > } > > + vma_start_write(vma); > > cleanup_vma_from_mm(vma); > > 3) I think this hunk could be dropped as Per-VMA lock depends on MMU > anyway. > > Thanks, > Hyeonggon > > > > > /* remove from the MM's tree and list */ > > @@ -1519,6 +1520,10 @@ void exit_mmap(struct mm_struct *mm) > > */ > > mmap_write_lock(mm); > > for_each_vma(vmi, vma) { > > + /* > > +* No need to lock VMA because this is the only mm user and no > > +* page fault handled can race with it. > > +*/ > > cleanup_vma_from_mm(vma); > > delete_vma(mm, vma); > > cond_resched(); > > -- > > 2.39.2.722.g9855ee24e9-goog > > > > >
Re: [PATCH v4 24/33] mm: fall back to mmap_lock if vma->anon_vma is not yet set
On Mon, Feb 27, 2023 at 09:36:23AM -0800, Suren Baghdasaryan wrote: > When vma->anon_vma is not set, page fault handler will set it by either > reusing anon_vma of an adjacent VMA if VMAs are compatible or by > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find > a compatible adjacent VMA and that requires not only the faulting VMA > to be stable but also the tree structure and other VMAs inside that tree. > Therefore locking just the faulting VMA is not enough for this search. > Fall back to taking mmap_lock when vma->anon_vma is not set. This > situation happens only on the first page fault and should not affect > overall performance. > > Signed-off-by: Suren Baghdasaryan > --- > mm/memory.c | 4 > 1 file changed, 4 insertions(+) > > diff --git a/mm/memory.c b/mm/memory.c > index bda4c1a991f0..8855846a361b 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -5243,6 +5243,10 @@ struct vm_area_struct *lock_vma_under_rcu(struct > mm_struct *mm, > if (!vma_is_anonymous(vma)) > goto inval; > > + /* find_mergeable_anon_vma uses adjacent vmas which are not locked */ > + if (!vma->anon_vma) > + goto inval; > + > if (!vma_start_read(vma)) > goto inval; Reviewed-by: Hyeonggon Yoo <42.hye...@gmail.com> > -- > 2.39.2.722.g9855ee24e9-goog > >
Re: [PATCH v4 18/33] mm: write-lock VMAs before removing them from VMA tree
On Wed, Mar 01, 2023 at 10:42:48AM -0800, Suren Baghdasaryan wrote: > On Wed, Mar 1, 2023 at 10:34 AM Suren Baghdasaryan wrote: > > > > On Tue, Feb 28, 2023 at 11:57 PM Hyeonggon Yoo <42.hye...@gmail.com> wrote: > > > > > > On Wed, Mar 01, 2023 at 07:43:33AM +, Hyeonggon Yoo wrote: > > > > On Mon, Feb 27, 2023 at 09:36:17AM -0800, Suren Baghdasaryan wrote: > > > > > Write-locking VMAs before isolating them ensures that page fault > > > > > handlers don't operate on isolated VMAs. > > > > > > > > > > Signed-off-by: Suren Baghdasaryan > > > > > --- > > > > > mm/mmap.c | 1 + > > > > > mm/nommu.c | 5 + > > > > > 2 files changed, 6 insertions(+) > > > > > > > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > > > index 1f42b9a52b9b..f7ed357056c4 100644 > > > > > --- a/mm/mmap.c > > > > > +++ b/mm/mmap.c > > > > > @@ -2255,6 +2255,7 @@ int split_vma(struct vma_iterator *vmi, struct > > > > > vm_area_struct *vma, > > > > > static inline int munmap_sidetree(struct vm_area_struct *vma, > > > > >struct ma_state *mas_detach) > > > > > { > > > > > + vma_start_write(vma); > > > > > mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1); > > > > > > > > I may be missing something, but have few questions: > > > > > > > > 1) Why does a writer need to both write-lock a VMA and mark the > > > > VMA detached > > > > when unmapping it, isn't it enough to just only write-lock a > > > > VMA? > > > > We need to mark the VMA detached to avoid handling page fault in a > > detached VMA. The possible scenario is: > > > > lock_vma_under_rcu > > vma = mas_walk(&mas) > > munmap_sidetree > > > > vma_start_write(vma) > > > > mas_store_gfp() // remove VMA from the tree > > > > vma_end_write_all() > > vma_start_read(vma) > > // we locked the VMA but it is not part of the tree anymore. > > > > So, marking the VMA locked before vma_end_write_all() and checking > > Sorry, I should have said "marking the VMA *detached* before > vma_end_write_all() and checking vma->detached after vma_start_read() > helps us avoid handling faults in the detached VMA." > > > vma->detached after vma_start_read() helps us avoid handling faults in > > the detached VMA. Thank you for explanation, that makes sense! By the way, if there are no 32bit users of Per-VMA lock (are there?), "detached" bool could be a VMA flag (i.e. making it depend on 64BIT and selecting ARCH_USES_HIGH_VMA_FLAGS) Thanks, Hyeonggon
Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
On Mon, Jan 09, 2023 at 12:53:36PM -0800, Suren Baghdasaryan wrote: > diff --git a/include/linux/mm.h b/include/linux/mm.h > index d40bf8a5e19e..294dd44b2198 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -627,12 +627,16 @@ static inline void vma_write_lock(struct vm_area_struct > *vma) >* mm->mm_lock_seq can't be concurrently modified. >*/ > mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq); > - if (vma->vm_lock_seq == mm_lock_seq) > + if (vma->vm_lock->lock_seq == mm_lock_seq) > return; > > - down_write(&vma->vm_lock->lock); > - vma->vm_lock_seq = mm_lock_seq; > - up_write(&vma->vm_lock->lock); > + if (atomic_cmpxchg(&vma->vm_lock->count, 0, -1)) > + wait_event(vma->vm_mm->vma_writer_wait, > +atomic_cmpxchg(&vma->vm_lock->count, 0, -1) == 0); > + vma->vm_lock->lock_seq = mm_lock_seq; > + /* Write barrier to ensure lock_seq change is visible before count */ > + smp_wmb(); > + atomic_set(&vma->vm_lock->count, 0); > } > > /* > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct > *vma) > static inline bool vma_read_trylock(struct vm_area_struct *vma) > { > /* Check before locking. A race might cause false locked result. */ > - if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq)) > + if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq)) > return false; > > - if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0)) > + if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count))) > return false; > > + /* If atomic_t overflows, restore and fail to lock. */ > + if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) { > + if (atomic_dec_and_test(&vma->vm_lock->count)) > + wake_up(&vma->vm_mm->vma_writer_wait); > + return false; > + } > + > /* >* Overflow might produce false locked result. >* False unlocked result is impossible because we modify and check >* vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq >* modification invalidates all existing locks. >*/ > - if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) { > - up_read(&vma->vm_lock->lock); > + if (unlikely(vma->vm_lock->lock_seq == > READ_ONCE(vma->vm_mm->mm_lock_seq))) { > + if (atomic_dec_and_test(&vma->vm_lock->count)) > + wake_up(&vma->vm_mm->vma_writer_wait); > return false; > } With this change readers can cause writers to starve. What about checking waitqueue_active() before or after increasing vma->vm_lock->count? -- Thanks, Hyeonggon