Re: [PATCH] mm/slab: rename CONFIG_SLAB to CONFIG_SLAB_DEPRECATED

2023-05-23 Thread Hyeonggon Yoo
On Tue, May 23, 2023 at 6:12 PM Vlastimil Babka  wrote:
>
> As discussed at LSF/MM [1] [2] and with no objections raised there,
> deprecate the SLAB allocator. Rename the user-visible option so that
> users with CONFIG_SLAB=y get a new prompt with explanation during make
> oldconfig, while make olddefconfig will just switch to SLUB.
>
> In all defconfigs with CONFIG_SLAB=y remove the line so those also
> switch to SLUB. Regressions due to the switch should be reported to
> linux-mm and slab maintainers.
>
> [1] https://lore.kernel.org/all/4b9fc9c6-b48c-198f-5f80-811a44737...@suse.cz/
> [2] https://lwn.net/Articles/932201/
>
> Signed-off-by: Vlastimil Babka 
> ---



> diff --git a/mm/Kconfig b/mm/Kconfig
> index 7672a22647b4..b537c4436d18 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -218,11 +218,18 @@ choice
> help
>This option allows to select a slab allocator.
>
> -config SLAB
> -   bool "SLAB"
> +config SLAB_DEPRECATED
> +   bool "SLAB (DEPRECATED)"
> depends on !PREEMPT_RT
> select HAVE_HARDENED_USERCOPY_ALLOCATOR
> help
> + Deprecated and scheduled for removal in a few cycles. Replaced by
> + SLUB.
> +
> + If you cannot migrate to SLUB, please contact linux...@kvack.org
> + and the people listed in the SLAB ALLOCATOR section of MAINTAINERS
> + file, explaining why.
> +
>   The regular slab allocator that is established and known to work
>   well in all environments. It organizes cache hot objects in
>   per cpu and per node queues.
> @@ -240,6 +247,11 @@ config SLUB
>
>  endchoice
>
> +config SLAB
> +   bool
> +   default y
> +   depends on SLAB_DEPRECATED
> +
>  config SLUB_TINY
> bool "Configure SLUB for minimal memory footprint"
> depends on SLUB && EXPERT
> --
> 2.40.1

Thank you for the work!
It looks good to me.
let's see some users raise their voice.

Acked-by: Hyeonggon Yoo <42.hye...@gmail.com>


Re: [PATCH v4 4/7] mm: replace vma->vm_flags direct modifications with modifier calls

2023-01-31 Thread Hyeonggon Yoo
On Thu, Jan 26, 2023 at 11:37:49AM -0800, Suren Baghdasaryan wrote:
> Replace direct modifications to vma->vm_flags with calls to modifier
> functions to be able to track flag changes and to keep vma locking
> correctness.
> 
> Signed-off-by: Suren Baghdasaryan 
> Acked-by: Michal Hocko 
> Acked-by: Mel Gorman 
> Acked-by: Mike Rapoport (IBM) 
> Acked-by: Sebastian Reichel 
> ---
>  arch/arm/kernel/process.c  |  2 +-
>  arch/ia64/mm/init.c|  8 
>  arch/loongarch/include/asm/tlb.h   |  2 +-
>  arch/powerpc/kvm/book3s_xive_native.c  |  2 +-
>  arch/powerpc/mm/book3s64/subpage_prot.c|  2 +-
>  arch/powerpc/platforms/book3s/vas-api.c|  2 +-
>  arch/powerpc/platforms/cell/spufs/file.c   | 14 +++---
>  arch/s390/mm/gmap.c|  3 +--
>  arch/x86/entry/vsyscall/vsyscall_64.c  |  2 +-
>  arch/x86/kernel/cpu/sgx/driver.c   |  2 +-
>  arch/x86/kernel/cpu/sgx/virt.c |  2 +-
>  arch/x86/mm/pat/memtype.c  |  6 +++---
>  arch/x86/um/mem_32.c   |  2 +-
>  drivers/acpi/pfr_telemetry.c   |  2 +-
>  drivers/android/binder.c   |  3 +--
>  drivers/char/mspec.c   |  2 +-
>  drivers/crypto/hisilicon/qm.c  |  2 +-
>  drivers/dax/device.c   |  2 +-
>  drivers/dma/idxd/cdev.c|  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c|  2 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c   |  4 ++--
>  drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c  |  4 ++--
>  drivers/gpu/drm/amd/amdkfd/kfd_events.c|  4 ++--
>  drivers/gpu/drm/amd/amdkfd/kfd_process.c   |  4 ++--
>  drivers/gpu/drm/drm_gem.c  |  2 +-
>  drivers/gpu/drm/drm_gem_dma_helper.c   |  3 +--
>  drivers/gpu/drm/drm_gem_shmem_helper.c |  2 +-
>  drivers/gpu/drm/drm_vm.c   |  8 
>  drivers/gpu/drm/etnaviv/etnaviv_gem.c  |  2 +-
>  drivers/gpu/drm/exynos/exynos_drm_gem.c|  4 ++--
>  drivers/gpu/drm/gma500/framebuffer.c   |  2 +-
>  drivers/gpu/drm/i810/i810_dma.c|  2 +-
>  drivers/gpu/drm/i915/gem/i915_gem_mman.c   |  4 ++--
>  drivers/gpu/drm/mediatek/mtk_drm_gem.c |  2 +-
>  drivers/gpu/drm/msm/msm_gem.c  |  2 +-
>  drivers/gpu/drm/omapdrm/omap_gem.c |  3 +--
>  drivers/gpu/drm/rockchip/rockchip_drm_gem.c|  3 +--
>  drivers/gpu/drm/tegra/gem.c|  5 ++---
>  drivers/gpu/drm/ttm/ttm_bo_vm.c|  3 +--
>  drivers/gpu/drm/virtio/virtgpu_vram.c  |  2 +-
>  drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c   |  2 +-
>  drivers/gpu/drm/xen/xen_drm_front_gem.c|  3 +--
>  drivers/hsi/clients/cmt_speech.c   |  2 +-
>  drivers/hwtracing/intel_th/msu.c   |  2 +-
>  drivers/hwtracing/stm/core.c   |  2 +-
>  drivers/infiniband/hw/hfi1/file_ops.c  |  4 ++--
>  drivers/infiniband/hw/mlx5/main.c  |  4 ++--
>  drivers/infiniband/hw/qib/qib_file_ops.c   | 13 ++---
>  drivers/infiniband/hw/usnic/usnic_ib_verbs.c   |  2 +-
>  drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.c|  2 +-
>  .../media/common/videobuf2/videobuf2-dma-contig.c  |  2 +-
>  drivers/media/common/videobuf2/videobuf2-vmalloc.c |  2 +-
>  drivers/media/v4l2-core/videobuf-dma-contig.c  |  2 +-
>  drivers/media/v4l2-core/videobuf-dma-sg.c  |  4 ++--
>  drivers/media/v4l2-core/videobuf-vmalloc.c |  2 +-
>  drivers/misc/cxl/context.c |  2 +-
>  drivers/misc/habanalabs/common/memory.c|  2 +-
>  drivers/misc/habanalabs/gaudi/gaudi.c  |  4 ++--
>  drivers/misc/habanalabs/gaudi2/gaudi2.c|  8 
>  drivers/misc/habanalabs/goya/goya.c|  4 ++--
>  drivers/misc/ocxl/context.c|  4 ++--
>  drivers/misc/ocxl/sysfs.c  |  2 +-
>  drivers/misc/open-dice.c   |  4 ++--
>  drivers/misc/sgi-gru/grufile.c |  4 ++--
>  drivers/misc/uacce/uacce.c |  2 +-
>  drivers/sbus/char/oradax.c |  2 +-
>  drivers/scsi/cxlflash/ocxl_hw.c|  2 +-
>  drivers/scsi/sg.c  |  2 +-
>  drivers/staging/media/atomisp/pci/hmm/hmm_bo.c |  2 +-
>  drivers/staging/media/deprecated/meye/meye.c   |  4 ++--
>  .../media/deprecated/stkwebcam/stk-webcam.c|  2 +-
>  drivers/target/target_core_user.c  |  2 +-
>  drivers/uio/uio.c  |  2 +-
>  dr

Re: [PATCH v4 4/7] mm: replace vma->vm_flags direct modifications with modifier calls

2023-02-01 Thread Hyeonggon Yoo
On Tue, Jan 31, 2023 at 10:54:22AM -0800, Suren Baghdasaryan wrote:
> On Tue, Jan 31, 2023 at 12:32 AM Hyeonggon Yoo <42.hye...@gmail.com> wrote:
> >
> > On Thu, Jan 26, 2023 at 11:37:49AM -0800, Suren Baghdasaryan wrote:
> > > Replace direct modifications to vma->vm_flags with calls to modifier
> > > functions to be able to track flag changes and to keep vma locking
> > > correctness.
> > >
> > > Signed-off-by: Suren Baghdasaryan 
> > > Acked-by: Michal Hocko 
> > > Acked-by: Mel Gorman 
> > > Acked-by: Mike Rapoport (IBM) 
> > > Acked-by: Sebastian Reichel 
> > > ---
> > >  arch/arm/kernel/process.c  |  2 +-
> > >  120 files changed, 188 insertions(+), 199 deletions(-)
> > >
> >
> > Hello Suren,
> 
> Hi Hyeonggon,
> 
> >
> > [...]
> >
> > Whoa, it's so long.
> > Mostly looks fine but two things I'm not sure about:
> >
> > > diff --git a/drivers/misc/open-dice.c b/drivers/misc/open-dice.c
> > > index 9dda47b3fd70..7be4e6c9f120 100644
> > > --- a/drivers/misc/open-dice.c
> > > +++ b/drivers/misc/open-dice.c
> > > @@ -95,12 +95,12 @@ static int open_dice_mmap(struct file *filp, struct 
> > > vm_area_struct *vma)
> > >   if (vma->vm_flags & VM_WRITE)
> > >   return -EPERM;
> > >   /* Ensure userspace cannot acquire VM_WRITE later. */
> > > - vma->vm_flags &= ~VM_MAYWRITE;
> > > + vm_flags_clear(vma, VM_MAYSHARE);
> > >   }
> >
> > I think it should be:
> > s/VM_MAYSHARE/VM_MAYWRITE/
> 
> Good eye! Yes, this is definitely a bug. Will post a next version with this 
> fix.
> 
> >
> > > diff --git a/mm/mlock.c b/mm/mlock.c
> > > index 5c4fff93cd6b..ed49459e343e 100644
> > > --- a/mm/mlock.c
> > > +++ b/mm/mlock.c
> > > @@ -380,7 +380,7 @@ static void mlock_vma_pages_range(struct 
> > > vm_area_struct *vma,
> > >*/
> > >   if (newflags & VM_LOCKED)
> > >   newflags |= VM_IO;
> > > - WRITE_ONCE(vma->vm_flags, newflags);
> > > + vm_flags_reset(vma, newflags);
> > >
> > >   lru_add_drain();
> > >   walk_page_range(vma->vm_mm, start, end, &mlock_walk_ops, NULL);
> > > @@ -388,7 +388,7 @@ static void mlock_vma_pages_range(struct 
> > > vm_area_struct *vma,
> > >
> > >   if (newflags & VM_IO) {
> > >   newflags &= ~VM_IO;
> > > - WRITE_ONCE(vma->vm_flags, newflags);
> > > + vm_flags_reset(vma, newflags);
> > >   }
> > >  }
> >
> > wondering the if the comment above is still true?
> >
> > /*
> >  * There is a slight chance that concurrent page migration,
> >  * or page reclaim finding a page of this now-VM_LOCKED vma,
> >  * will call mlock_vma_folio() and raise page's mlock_count:
> >  * double counting, leaving the page unevictable indefinitely.
> >  * Communicate this danger to mlock_vma_folio() with VM_IO,
> >  * which is a VM_SPECIAL flag not allowed on VM_LOCKED vmas.
> >  * mmap_lock is held in write mode here, so this weird
> >  * combination should not be visible to other mmap_lock users;
> >  * but WRITE_ONCE so rmap walkers must see VM_IO if VM_LOCKED.
> >  */
> >
> > does ACCESS_PRIVATE() still guarentee that compiler cannot mysteriously
> > optimize writes like WRITE_ONCE()?
> 
> I don't see ACCESS_PRIVATE() providing the same guarantees as
> WRITE_ONCE(), therefore I think this also needs to be changed. I'll
> need to introduce something like vm_flags_reset_once() and use it
> here. vm_flags_reset_once() would do WRITE_ONCE() and otherwise would
> be identical to vm_flags_reset().
>
> I'll post a new version with the fixes later today.
>
> Thanks for the review!
> Suren.

Thanks for quick reply!

Andrew's fix and the new patch looks good to me.
with these two things addressed:

Reviewed-by: Hyeonggon Yoo <42.hye...@gmail.com>


Regards,
Hyeonggon


Re: [PATCH 1/1] mm: introduce vm_flags_reset_once to replace WRITE_ONCE vm_flags updates

2023-02-07 Thread Hyeonggon Yoo
On Tue, Jan 31, 2023 at 04:01:16PM -0800, Suren Baghdasaryan wrote:
> Provide vm_flags_reset_once() and replace the vm_flags updates which used
> WRITE_ONCE() to prevent compiler optimizations.
> 
> Fixes: 0cce31a0aa0e ("mm: replace vma->vm_flags direct modifications with 
> modifier calls")
> Reported-by: Hyeonggon Yoo <42.hye...@gmail.com>
> Signed-off-by: Suren Baghdasaryan 
> ---
> Notes:
> - The patch applies cleanly over mm-unstable
> - The SHA in Fixes: line is from mm-unstable, so is... unstable
> 
>  include/linux/mm.h | 7 +++
>  mm/mlock.c | 4 ++--
>  2 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5bf0ad48faaa..23ce04f6e91e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -648,6 +648,13 @@ static inline void vm_flags_reset(struct vm_area_struct 
> *vma,
>   vm_flags_init(vma, flags);
>  }
>  
> +static inline void vm_flags_reset_once(struct vm_area_struct *vma,
> +vm_flags_t flags)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + WRITE_ONCE(ACCESS_PRIVATE(vma, __vm_flags), flags);
> +}
> +
>  static inline void vm_flags_set(struct vm_area_struct *vma,
>   vm_flags_t flags)
>  {
> diff --git a/mm/mlock.c b/mm/mlock.c
> index ed49459e343e..617469fce96d 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -380,7 +380,7 @@ static void mlock_vma_pages_range(struct vm_area_struct 
> *vma,
>*/
>   if (newflags & VM_LOCKED)
>   newflags |= VM_IO;
> - vm_flags_reset(vma, newflags);
> + vm_flags_reset_once(vma, newflags);
>  
>   lru_add_drain();
>   walk_page_range(vma->vm_mm, start, end, &mlock_walk_ops, NULL);
> @@ -388,7 +388,7 @@ static void mlock_vma_pages_range(struct vm_area_struct 
> *vma,
>  
>   if (newflags & VM_IO) {
>   newflags &= ~VM_IO;
> - vm_flags_reset(vma, newflags);
> + vm_flags_reset_once(vma, newflags);
>   }
>  }

Reviewed-by: Hyeonggon Yoo <42.hye...@gmail.com>

>  
> -- 
> 2.39.1.456.gfc5497dd1b-goog
> 


Re: [PATCH v4 2/7] mm: introduce vma->vm_flags wrapper functions

2023-02-07 Thread Hyeonggon Yoo
On Thu, Jan 26, 2023 at 11:37:47AM -0800, Suren Baghdasaryan wrote:
> vm_flags are among VMA attributes which affect decisions like VMA merging
> and splitting. Therefore all vm_flags modifications are performed after
> taking exclusive mmap_lock to prevent vm_flags updates racing with such
> operations. Introduce modifier functions for vm_flags to be used whenever
> flags are updated. This way we can better check and control correct
> locking behavior during these updates.
> 
> Signed-off-by: Suren Baghdasaryan 
> Reviewed-by: Davidlohr Bueso 
> Acked-by: Michal Hocko 
> Acked-by: Mel Gorman 
> ---
>  include/linux/mm.h   | 40 
>  include/linux/mm_types.h | 10 +-
>  2 files changed, 49 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8d636e895ee9..abb31103d060 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -627,6 +627,46 @@ static inline void vma_init(struct vm_area_struct *vma, 
> struct mm_struct *mm)
>   INIT_LIST_HEAD(&vma->anon_vma_chain);
>  }
>  
> +/* Use when VMA is not part of the VMA tree and needs no locking */
> +static inline void vm_flags_init(struct vm_area_struct *vma,
> +  vm_flags_t flags)
> +{
> + ACCESS_PRIVATE(vma, __vm_flags) = flags;
> +}
> +
> +/* Use when VMA is part of the VMA tree and modifications need coordination 
> */
> +static inline void vm_flags_reset(struct vm_area_struct *vma,
> +   vm_flags_t flags)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + vm_flags_init(vma, flags);
> +}
> +
> +static inline void vm_flags_set(struct vm_area_struct *vma,
> + vm_flags_t flags)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + ACCESS_PRIVATE(vma, __vm_flags) |= flags;
> +}
> +
> +static inline void vm_flags_clear(struct vm_area_struct *vma,
> +   vm_flags_t flags)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + ACCESS_PRIVATE(vma, __vm_flags) &= ~flags;
> +}
> +
> +/*
> + * Use only when the order of set/clear operations is unimportant, otherwise
> + * use vm_flags_{set|clear} explicitly.
> + */
> +static inline void vm_flags_mod(struct vm_area_struct *vma,
> + vm_flags_t set, vm_flags_t clear)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + vm_flags_init(vma, (vma->vm_flags | set) & ~clear);
> +}
> +
>  static inline void vma_set_anonymous(struct vm_area_struct *vma)
>  {
>   vma->vm_ops = NULL;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 2d6d790d9bed..da983aedb741 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -491,7 +491,15 @@ struct vm_area_struct {
>* See vmf_insert_mixed_prot() for discussion.
>*/
>   pgprot_t vm_page_prot;
> - unsigned long vm_flags; /* Flags, see mm.h. */
> +
> + /*
> +  * Flags, see mm.h.
> +  * To modify use vm_flags_{init|reset|set|clear|mod} functions.
> +  */
> + union {
> + const vm_flags_t vm_flags;
> + vm_flags_t __private __vm_flags;
> + };
>  
>   /*
>* For areas with an address space and backing store,

Reviewed-by: Hyeonggon Yoo <42.hye...@gmail.com>

> -- 
> 2.39.1
> 
> 


Re: [PATCH v4 5/7] mm: replace vma->vm_flags indirect modification in ksm_madvise

2023-02-07 Thread Hyeonggon Yoo
On Thu, Jan 26, 2023 at 11:37:50AM -0800, Suren Baghdasaryan wrote:
> Replace indirect modifications to vma->vm_flags with calls to modifier
> functions to be able to track flag changes and to keep vma locking
> correctness.
> 
> Signed-off-by: Suren Baghdasaryan 
> Acked-by: Michal Hocko 
> Acked-by: Mel Gorman 
> Acked-by: Mike Rapoport (IBM) 
> ---
>  arch/powerpc/kvm/book3s_hv_uvmem.c | 6 +-
>  arch/s390/mm/gmap.c| 6 +-
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
> b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 1d67baa5557a..709ebd578394 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -393,6 +393,7 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
>  {
>   unsigned long gfn = memslot->base_gfn;
>   unsigned long end, start = gfn_to_hva(kvm, gfn);
> + unsigned long vm_flags;
>   int ret = 0;
>   struct vm_area_struct *vma;
>   int merge_flag = (merge) ? MADV_MERGEABLE : MADV_UNMERGEABLE;
> @@ -409,12 +410,15 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
>   ret = H_STATE;
>   break;
>   }
> + /* Copy vm_flags to avoid partial modifications in ksm_madvise 
> */
> + vm_flags = vma->vm_flags;
>   ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
> -   merge_flag, &vma->vm_flags);
> +   merge_flag, &vm_flags);
>   if (ret) {
>   ret = H_STATE;
>   break;
>   }
> + vm_flags_reset(vma, vm_flags);
>   start = vma->vm_end;
>   } while (end > vma->vm_end);
>  
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index ab836597419d..5a716bdcba05 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -2587,14 +2587,18 @@ int gmap_mark_unmergeable(void)
>  {
>   struct mm_struct *mm = current->mm;
>   struct vm_area_struct *vma;
> + unsigned long vm_flags;
>   int ret;
>   VMA_ITERATOR(vmi, mm, 0);
>  
>   for_each_vma(vmi, vma) {
> + /* Copy vm_flags to avoid partial modifications in ksm_madvise 
> */
> + vm_flags = vma->vm_flags;
>   ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
> -   MADV_UNMERGEABLE, &vma->vm_flags);
> +       MADV_UNMERGEABLE, &vm_flags);
>   if (ret)
>   return ret;
> + vm_flags_reset(vma, vm_flags);
>   }
>   mm->def_flags &= ~VM_MERGEABLE;
>   return 0;
> -- 

Reviewed-by: Hyeonggon Yoo <42.hye...@gmail.com>

> 2.39.1
> 
> 


Re: [PATCH v3 26/35] mm: fall back to mmap_lock if vma->anon_vma is not yet set

2023-02-17 Thread Hyeonggon Yoo
On Fri, Feb 17, 2023 at 11:15 AM Suren Baghdasaryan  wrote:
>
> On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan  wrote:
> >
> > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox  wrote:
> > >
> > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote:
> > > > When vma->anon_vma is not set, page fault handler will set it by either
> > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by
> > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find
> > > > a compatible adjacent VMA and that requires not only the faulting VMA
> > > > to be stable but also the tree structure and other VMAs inside that 
> > > > tree.
> > > > Therefore locking just the faulting VMA is not enough for this search.
> > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This
> > > > situation happens only on the first page fault and should not affect
> > > > overall performance.
> > >
> > > I think I asked this before, but don't remember getting an aswer.
> > > Why do we defer setting anon_vma to the first fault?  Why don't we
> > > set it up at mmap time?
> >
> > Yeah, I remember that conversation Matthew and I could not find the
> > definitive answer at the time. I'll look into that again or maybe
> > someone can answer it here.
>
> After looking into it again I'm still under the impression that
> vma->anon_vma is populated lazily (during the first page fault rather
> than at mmap time) to avoid doing extra work for areas which are never
> faulted. Though I might be missing some important detail here.

I think this is because the kernel cannot merge VMAs that have
different anon_vmas?

Enabling lazy population of anon_vma could potentially increase the
chances of merging VMAs.

> > In the end rather than changing that logic I decided to skip
> > vma->anon_vma==NULL cases because I measured them being less than
> > 0.01% of all page faults, so ROI from changing that would be quite
> > low. But I agree that the logic is weird and maybe we can improve
> > that. I will have to review that again when I'm working on eliminating
> > all these special cases we skip, like swap/userfaults/etc.


Re: [PATCH v3 26/35] mm: fall back to mmap_lock if vma->anon_vma is not yet set

2023-02-17 Thread Hyeonggon Yoo
On Fri, Feb 17, 2023 at 08:13:01AM -0800, Suren Baghdasaryan wrote:
> On Fri, Feb 17, 2023 at 2:21 AM Hyeonggon Yoo <42.hye...@gmail.com> wrote:
> >
> > On Fri, Feb 17, 2023 at 11:15 AM Suren Baghdasaryan  
> > wrote:
> > >
> > > On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan  
> > > wrote:
> > > >
> > > > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox  
> > > > wrote:
> > > > >
> > > > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote:
> > > > > > When vma->anon_vma is not set, page fault handler will set it by 
> > > > > > either
> > > > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by
> > > > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to 
> > > > > > find
> > > > > > a compatible adjacent VMA and that requires not only the faulting 
> > > > > > VMA
> > > > > > to be stable but also the tree structure and other VMAs inside that 
> > > > > > tree.
> > > > > > Therefore locking just the faulting VMA is not enough for this 
> > > > > > search.
> > > > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This
> > > > > > situation happens only on the first page fault and should not affect
> > > > > > overall performance.
> > > > >
> > > > > I think I asked this before, but don't remember getting an aswer.
> > > > > Why do we defer setting anon_vma to the first fault?  Why don't we
> > > > > set it up at mmap time?
> > > >
> > > > Yeah, I remember that conversation Matthew and I could not find the
> > > > definitive answer at the time. I'll look into that again or maybe
> > > > someone can answer it here.
> > >
> > > After looking into it again I'm still under the impression that
> > > vma->anon_vma is populated lazily (during the first page fault rather
> > > than at mmap time) to avoid doing extra work for areas which are never
> > > faulted. Though I might be missing some important detail here.
> >
> > I think this is because the kernel cannot merge VMAs that have
> > different anon_vmas?
> >
> > Enabling lazy population of anon_vma could potentially increase the
> > chances of merging VMAs.
> 
> Hmm. Do you have a clear explanation why merging chances increase this
> way? A couple of possibilities I can think of would be:
> 1. If after mmap'ing a VMA and before faulting the first page into it
> we often change something that affects anon_vma_compatible() decision,
> like vm_policy;
> 2. When mmap'ing VMAs we do not map them consecutively but the final
> arrangement is actually contiguous.
> 
> Don't think either of those cases would be very representative of a
> usual case but maybe I'm wrong or there is another reason?

Ok. I agree it does not represent common cases.

Hmm then I wonder how it went from the initial approach of "allocate anon_vma
objects only via fork()" [1] to "populate anon_vma at page faults". [2] [3]

Maybe Hugh, Andrea or Andrew have opinions?

[1] anon_vma RFC2, lore.kernel.org
https://lore.kernel.org/lkml/20040311065254.GT30940@dualathlon.random

[2] The status of object-based reverse mapping, LWN.net
https://lwn.net/Articles/85908

[3] rmap 39 add anon_vma rmap
https://gitlab.com/hyeyoo/linux-historical/-/commit/8aa3448cabdfca146aa3fd36e852d0209fb2276a

> 
> >
> > > > In the end rather than changing that logic I decided to skip
> > > > vma->anon_vma==NULL cases because I measured them being less than
> > > > 0.01% of all page faults, so ROI from changing that would be quite
> > > > low. But I agree that the logic is weird and maybe we can improve
> > > > that. I will have to review that again when I'm working on eliminating
> > > > all these special cases we skip, like swap/userfaults/etc.
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an 
> > email to kernel-team+unsubscr...@android.com.
> >


Re: [PATCH v3 16/35] mm/mmap: write-lock VMAs before merging, splitting or expanding them

2023-02-23 Thread Hyeonggon Yoo
On Wed, Feb 15, 2023 at 09:17:31PM -0800, Suren Baghdasaryan wrote:
> Decisions about whether VMAs can be merged, split or expanded must be
> made while VMAs are protected from the changes which can affect that
> decision. For example, merge_vma uses vma->anon_vma in its decision

Did you mean vma_merge()?

> whether the VMA can be merged. Meanwhile, page fault handler changes
> vma->anon_vma during COW operation.
> Write-lock all VMAs which might be affected by a merge or split operation
> before making decision how such operations should be performed.
>

It doesn't make sense (to me) to update vma->anon_vma during COW fault.

AFAIK children's vma->anon_vma is allocated during fork() and
page->anon_vma is updated on COW to reduce rmap walking because it's now
unshared, no?

As patch 26 just falls back to mmap_lock if no anon_vma is set,
I think we can assume nothing updates vma->anon_vma (and its interval
tree) if we are holding mmap_lock in write mode.

Or am I missing something?

--
Regards,
Hyeonggon


Re: [PATCH v3 16/35] mm/mmap: write-lock VMAs before merging, splitting or expanding them

2023-02-23 Thread Hyeonggon Yoo
On Thu, Feb 23, 2023 at 02:51:27PM +, Hyeonggon Yoo wrote:
> On Wed, Feb 15, 2023 at 09:17:31PM -0800, Suren Baghdasaryan wrote:
> > Decisions about whether VMAs can be merged, split or expanded must be
> > made while VMAs are protected from the changes which can affect that
> > decision. For example, merge_vma uses vma->anon_vma in its decision
> 
> Did you mean vma_merge()?
> 
> > whether the VMA can be merged. Meanwhile, page fault handler changes
> > vma->anon_vma during COW operation.
> > Write-lock all VMAs which might be affected by a merge or split operation
> > before making decision how such operations should be performed.
> >
> 
> It doesn't make sense (to me) to update vma->anon_vma during COW fault.
> 
> AFAIK children's vma->anon_vma is allocated during fork() and
> page->anon_vma is updated on COW to reduce rmap walking because it's now
> unshared, no?
> 
> As patch 26 just falls back to mmap_lock if no anon_vma is set,
> I think we can assume nothing updates vma->anon_vma (and its interval
> tree) if we are holding mmap_lock in write mode.

*Not interval tree, as it can be modified by other processes*

> 
> Or am I missing something?
> 
> --
> Regards,
> Hyeonggon
> 


Re: [PATCH v4 17/33] mm/mremap: write-lock VMA while remapping it to a new address range

2023-02-28 Thread Hyeonggon Yoo
On Mon, Feb 27, 2023 at 09:36:16AM -0800, Suren Baghdasaryan wrote:
> Write-lock VMA as locked before copying it and when copy_vma produces
> a new VMA.
> 
> Signed-off-by: Suren Baghdasaryan 
> Reviewed-by: Laurent Dufour 
> ---
>  mm/mmap.c   | 1 +
>  mm/mremap.c | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index e73fbb84ce12..1f42b9a52b9b 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3189,6 +3189,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct 
> **vmap,
>   get_file(new_vma->vm_file);
>   if (new_vma->vm_ops && new_vma->vm_ops->open)
>   new_vma->vm_ops->open(new_vma);
> + vma_start_write(new_vma);

Oh, it's to prevent handling page faults during move_page_tables().


>   if (vma_link(mm, new_vma))
>   goto out_vma_link;
>   *need_rmap_locks = false;
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 1ddf7beb62e9..327c38eb132e 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -623,6 +623,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
>   return -ENOMEM;
>   }
>  
> + vma_start_write(vma);
>   new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT);
>   new_vma = copy_vma(&vma, new_addr, new_len, new_pgoff,
>  &need_rmap_locks);
> -- 
> 2.39.2.722.g9855ee24e9-goog

Looks good to me.

Reviewed-by: Hyeonggon Yoo <42.hye...@gmail.com>

> 
> 


Re: [PATCH v4 18/33] mm: write-lock VMAs before removing them from VMA tree

2023-02-28 Thread Hyeonggon Yoo
On Mon, Feb 27, 2023 at 09:36:17AM -0800, Suren Baghdasaryan wrote:
> Write-locking VMAs before isolating them ensures that page fault
> handlers don't operate on isolated VMAs.
> 
> Signed-off-by: Suren Baghdasaryan 
> ---
>  mm/mmap.c  | 1 +
>  mm/nommu.c | 5 +
>  2 files changed, 6 insertions(+)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 1f42b9a52b9b..f7ed357056c4 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2255,6 +2255,7 @@ int split_vma(struct vma_iterator *vmi, struct 
> vm_area_struct *vma,
>  static inline int munmap_sidetree(struct vm_area_struct *vma,
>  struct ma_state *mas_detach)
>  {
> + vma_start_write(vma);
>   mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1);

I may be missing something, but have few questions:

1) Why does a writer need to both write-lock a VMA and mark the VMA 
detached
   when unmapping it, isn't it enough to just only write-lock a VMA?

2) as VMAs that are going to be removed are already locked in 
vma_prepare(),
   so I think this hunk could be dropped?

>   if (mas_store_gfp(mas_detach, vma, GFP_KERNEL))
>   return -ENOMEM;
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 57ba243c6a37..2ab162d773e2 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -588,6 +588,7 @@ static int delete_vma_from_mm(struct vm_area_struct *vma)
>  current->pid);
>   return -ENOMEM;
>   }
> + vma_start_write(vma);
>   cleanup_vma_from_mm(vma);

3) I think this hunk could be dropped as Per-VMA lock depends on MMU 
anyway.

Thanks,
Hyeonggon

>  
>   /* remove from the MM's tree and list */
> @@ -1519,6 +1520,10 @@ void exit_mmap(struct mm_struct *mm)
>*/
>   mmap_write_lock(mm);
>   for_each_vma(vmi, vma) {
> + /*
> +  * No need to lock VMA because this is the only mm user and no
> +  * page fault handled can race with it.
> +  */
>   cleanup_vma_from_mm(vma);
>   delete_vma(mm, vma);
>   cond_resched();
> -- 
> 2.39.2.722.g9855ee24e9-goog
> 
> 


Re: [PATCH v4 18/33] mm: write-lock VMAs before removing them from VMA tree

2023-02-28 Thread Hyeonggon Yoo
On Wed, Mar 01, 2023 at 07:43:33AM +, Hyeonggon Yoo wrote:
> On Mon, Feb 27, 2023 at 09:36:17AM -0800, Suren Baghdasaryan wrote:
> > Write-locking VMAs before isolating them ensures that page fault
> > handlers don't operate on isolated VMAs.
> > 
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  mm/mmap.c  | 1 +
> >  mm/nommu.c | 5 +
> >  2 files changed, 6 insertions(+)
> > 
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 1f42b9a52b9b..f7ed357056c4 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2255,6 +2255,7 @@ int split_vma(struct vma_iterator *vmi, struct 
> > vm_area_struct *vma,
> >  static inline int munmap_sidetree(struct vm_area_struct *vma,
> >struct ma_state *mas_detach)
> >  {
> > +   vma_start_write(vma);
> > mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1);
> 
> I may be missing something, but have few questions:
> 
>   1) Why does a writer need to both write-lock a VMA and mark the VMA 
> detached
>  when unmapping it, isn't it enough to just only write-lock a VMA?
> 
>   2) as VMAs that are going to be removed are already locked in 
> vma_prepare(),
>  so I think this hunk could be dropped?

After sending this just realized that I did not consider simple munmap case :)
But I still think 1) and 3) are valid question.

> 
> > if (mas_store_gfp(mas_detach, vma, GFP_KERNEL))
> > return -ENOMEM;
> > diff --git a/mm/nommu.c b/mm/nommu.c
> > index 57ba243c6a37..2ab162d773e2 100644
> > --- a/mm/nommu.c
> > +++ b/mm/nommu.c
> > @@ -588,6 +588,7 @@ static int delete_vma_from_mm(struct vm_area_struct 
> > *vma)
> >current->pid);
> > return -ENOMEM;
> > }
> > +   vma_start_write(vma);
> > cleanup_vma_from_mm(vma);
> 
>   3) I think this hunk could be dropped as Per-VMA lock depends on MMU 
> anyway.
> 
> Thanks,
> Hyeonggon
> 
> >  
> > /* remove from the MM's tree and list */
> > @@ -1519,6 +1520,10 @@ void exit_mmap(struct mm_struct *mm)
> >  */
> > mmap_write_lock(mm);
> > for_each_vma(vmi, vma) {
> > +   /*
> > +* No need to lock VMA because this is the only mm user and no
> > +* page fault handled can race with it.
> > +*/
> > cleanup_vma_from_mm(vma);
> > delete_vma(mm, vma);
> > cond_resched();
> > -- 
> > 2.39.2.722.g9855ee24e9-goog
> > 
> > 
> 


Re: [PATCH v4 24/33] mm: fall back to mmap_lock if vma->anon_vma is not yet set

2023-03-01 Thread Hyeonggon Yoo
On Mon, Feb 27, 2023 at 09:36:23AM -0800, Suren Baghdasaryan wrote:
> When vma->anon_vma is not set, page fault handler will set it by either
> reusing anon_vma of an adjacent VMA if VMAs are compatible or by
> allocating a new one. find_mergeable_anon_vma() walks VMA tree to find
> a compatible adjacent VMA and that requires not only the faulting VMA
> to be stable but also the tree structure and other VMAs inside that tree.
> Therefore locking just the faulting VMA is not enough for this search.
> Fall back to taking mmap_lock when vma->anon_vma is not set. This
> situation happens only on the first page fault and should not affect
> overall performance.
> 
> Signed-off-by: Suren Baghdasaryan 
> ---
>  mm/memory.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index bda4c1a991f0..8855846a361b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5243,6 +5243,10 @@ struct vm_area_struct *lock_vma_under_rcu(struct 
> mm_struct *mm,
>   if (!vma_is_anonymous(vma))
>   goto inval;
>  
> + /* find_mergeable_anon_vma uses adjacent vmas which are not locked */
> + if (!vma->anon_vma)
> + goto inval;
> +
>   if (!vma_start_read(vma))
>   goto inval;

Reviewed-by: Hyeonggon Yoo <42.hye...@gmail.com>

> -- 
> 2.39.2.722.g9855ee24e9-goog
> 
> 


Re: [PATCH v4 18/33] mm: write-lock VMAs before removing them from VMA tree

2023-03-01 Thread Hyeonggon Yoo
On Wed, Mar 01, 2023 at 10:42:48AM -0800, Suren Baghdasaryan wrote:
> On Wed, Mar 1, 2023 at 10:34 AM Suren Baghdasaryan  wrote:
> >
> > On Tue, Feb 28, 2023 at 11:57 PM Hyeonggon Yoo <42.hye...@gmail.com> wrote:
> > >
> > > On Wed, Mar 01, 2023 at 07:43:33AM +, Hyeonggon Yoo wrote:
> > > > On Mon, Feb 27, 2023 at 09:36:17AM -0800, Suren Baghdasaryan wrote:
> > > > > Write-locking VMAs before isolating them ensures that page fault
> > > > > handlers don't operate on isolated VMAs.
> > > > >
> > > > > Signed-off-by: Suren Baghdasaryan 
> > > > > ---
> > > > >  mm/mmap.c  | 1 +
> > > > >  mm/nommu.c | 5 +
> > > > >  2 files changed, 6 insertions(+)
> > > > >
> > > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > > index 1f42b9a52b9b..f7ed357056c4 100644
> > > > > --- a/mm/mmap.c
> > > > > +++ b/mm/mmap.c
> > > > > @@ -2255,6 +2255,7 @@ int split_vma(struct vma_iterator *vmi, struct 
> > > > > vm_area_struct *vma,
> > > > >  static inline int munmap_sidetree(struct vm_area_struct *vma,
> > > > >struct ma_state *mas_detach)
> > > > >  {
> > > > > +   vma_start_write(vma);
> > > > > mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1);
> > > >
> > > > I may be missing something, but have few questions:
> > > >
> > > >   1) Why does a writer need to both write-lock a VMA and mark the 
> > > > VMA detached
> > > >  when unmapping it, isn't it enough to just only write-lock a 
> > > > VMA?
> >
> > We need to mark the VMA detached to avoid handling page fault in a
> > detached VMA. The possible scenario is:
> >
> > lock_vma_under_rcu
> >   vma = mas_walk(&mas)
> > munmap_sidetree
> >   
> > vma_start_write(vma)
> >
> > mas_store_gfp() // remove VMA from the tree
> >   
> > vma_end_write_all()
> >   vma_start_read(vma)
> >   // we locked the VMA but it is not part of the tree anymore.
> >
> > So, marking the VMA locked before vma_end_write_all() and checking
> 
> Sorry, I should have said "marking the VMA *detached* before
> vma_end_write_all() and checking vma->detached after vma_start_read()
> helps us avoid handling faults in the detached VMA."
> 
> > vma->detached after vma_start_read() helps us avoid handling faults in
> > the detached VMA.

Thank you for explanation, that makes sense!

By the way, if there are no 32bit users of Per-VMA lock (are there?),
"detached" bool could be a VMA flag (i.e. making it depend on 64BIT
and selecting ARCH_USES_HIGH_VMA_FLAGS)

Thanks,
Hyeonggon



Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-16 Thread Hyeonggon Yoo
On Mon, Jan 09, 2023 at 12:53:36PM -0800, Suren Baghdasaryan wrote:
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d40bf8a5e19e..294dd44b2198 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -627,12 +627,16 @@ static inline void vma_write_lock(struct vm_area_struct 
> *vma)
>* mm->mm_lock_seq can't be concurrently modified.
>*/
>   mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
> - if (vma->vm_lock_seq == mm_lock_seq)
> + if (vma->vm_lock->lock_seq == mm_lock_seq)
>   return;
>  
> - down_write(&vma->vm_lock->lock);
> - vma->vm_lock_seq = mm_lock_seq;
> - up_write(&vma->vm_lock->lock);
> + if (atomic_cmpxchg(&vma->vm_lock->count, 0, -1))
> + wait_event(vma->vm_mm->vma_writer_wait,
> +atomic_cmpxchg(&vma->vm_lock->count, 0, -1) == 0);
> + vma->vm_lock->lock_seq = mm_lock_seq;
> + /* Write barrier to ensure lock_seq change is visible before count */
> + smp_wmb();
> + atomic_set(&vma->vm_lock->count, 0);
>  }
>  
>  /*
> @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct 
> *vma)
>  static inline bool vma_read_trylock(struct vm_area_struct *vma)
>  {
>   /* Check before locking. A race might cause false locked result. */
> - if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> + if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
>   return false;
>  
> - if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> + if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
>   return false;
>  
> + /* If atomic_t overflows, restore and fail to lock. */
> + if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> + if (atomic_dec_and_test(&vma->vm_lock->count))
> + wake_up(&vma->vm_mm->vma_writer_wait);
> + return false;
> + }
> +
>   /*
>* Overflow might produce false locked result.
>* False unlocked result is impossible because we modify and check
>* vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
>* modification invalidates all existing locks.
>*/
> - if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> - up_read(&vma->vm_lock->lock);
> + if (unlikely(vma->vm_lock->lock_seq == 
> READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> + if (atomic_dec_and_test(&vma->vm_lock->count))
> + wake_up(&vma->vm_mm->vma_writer_wait);
>   return false;
>   }

With this change readers can cause writers to starve.
What about checking waitqueue_active() before or after increasing
vma->vm_lock->count?

--
Thanks,
Hyeonggon