Re: [Intel-gfx] [PATCH] x86: Mark up large pm4/5 constants with UL

2018-04-29 Thread Kirill A. Shutemov
On Sun, Apr 29, 2018 at 12:48:32PM +0100, Chris Wilson wrote:
> To silence sparse while maintaining compatibility with the assembly, use
> _UL which conditionally only appends the UL suffix for C code.

http://lkml.kernel.org/r/nycvar.yfh.7.76.1804121437350.28...@cbobk.fhfr.pm

-- 
 Kirill A. Shutemov
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] shmem: fix pageflags after swapping DMA32 object

2016-11-07 Thread Kirill A. Shutemov
On Sun, Nov 06, 2016 at 08:08:29PM -0800, Hugh Dickins wrote:
> If shmem_alloc_page() does not set PageLocked and PageSwapBacked, then
> shmem_replace_page() needs to do so for itself.  Without this, it puts
> newpage on the wrong lru, re-unlocks the unlocked newpage, and system
> descends into "Bad page" reports and freeze; or if CONFIG_DEBUG_VM=y,
> it hits an earlier VM_BUG_ON_PAGE(!PageLocked), depending on config.
> 
> But shmem_replace_page() is not a common path: it's only called when
> swapin (or swapoff) finds the page was already read into an unsuitable
> zone: usually all zones are suitable, but gem objects for a few drm
> devices (gma500, omapdrm, crestline, broadwater) require zone DMA32
> if there's more than 4GB of ram.
> 
> Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> Cc: sta...@vger.kernel.org # v4.8
> Signed-off-by: Hugh Dickins 

Sorry for that.

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH v4 1/2] shmem: Support for registration of driver/file owner specific ops

2016-04-25 Thread Kirill A. Shutemov
On Mon, Apr 04, 2016 at 02:18:10PM +0100, Chris Wilson wrote:
> From: Akash Goel 
> 
> This provides support for the drivers or shmem file owners to register
> a set of callbacks, which can be invoked from the address space
> operations methods implemented by shmem.  This allow the file owners to
> hook into the shmem address space operations to do some extra/custom
> operations in addition to the default ones.
> 
> The private_data field of address_space struct is used to store the
> pointer to driver specific ops.  Currently only one ops field is defined,
> which is migratepage, but can be extended on an as-needed basis.
> 
> The need for driver specific operations arises since some of the
> operations (like migratepage) may not be handled completely within shmem,
> so as to be effective, and would need some driver specific handling also.
> Specifically, i915.ko would like to participate in migratepage().
> i915.ko uses shmemfs to provide swappable backing storage for its user
> objects, but when those objects are in use by the GPU it must pin the
> entire object until the GPU is idle.  As a result, large chunks of memory
> can be arbitrarily withdrawn from page migration, resulting in premature
> out-of-memory due to fragmentation.  However, if i915.ko can receive the
> migratepage() request, it can then flush the object from the GPU, remove
> its pin and thus enable the migration.
> 
> Since gfx allocations are one of the major consumer of system memory, its
> imperative to have such a mechanism to effectively deal with
> fragmentation.  And therefore the need for such a provision for initiating
> driver specific actions during address space operations.

Hm. Sorry, my ignorance, but shouldn't this kind of flushing be done in
response to mmu_notifier's ->invalidate_page?

I'm not aware about how i915 works and what's its expectation wrt shmem.
Do you have some userspace VMA which is mirrored on GPU side?
If yes, migration would cause unmapping of these pages and trigger the
mmu_notifier's hook.

-- 
 Kirill A. Shutemov
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH v4 1/2] shmem: Support for registration of driver/file owner specific ops

2016-04-26 Thread Kirill A. Shutemov
On Tue, Apr 26, 2016 at 02:53:41PM +0200, Daniel Vetter wrote:
> On Mon, Apr 25, 2016 at 02:42:50AM +0300, Kirill A. Shutemov wrote:
> > On Mon, Apr 04, 2016 at 02:18:10PM +0100, Chris Wilson wrote:
> > > From: Akash Goel 
> > > 
> > > This provides support for the drivers or shmem file owners to register
> > > a set of callbacks, which can be invoked from the address space
> > > operations methods implemented by shmem.  This allow the file owners to
> > > hook into the shmem address space operations to do some extra/custom
> > > operations in addition to the default ones.
> > > 
> > > The private_data field of address_space struct is used to store the
> > > pointer to driver specific ops.  Currently only one ops field is defined,
> > > which is migratepage, but can be extended on an as-needed basis.
> > > 
> > > The need for driver specific operations arises since some of the
> > > operations (like migratepage) may not be handled completely within shmem,
> > > so as to be effective, and would need some driver specific handling also.
> > > Specifically, i915.ko would like to participate in migratepage().
> > > i915.ko uses shmemfs to provide swappable backing storage for its user
> > > objects, but when those objects are in use by the GPU it must pin the
> > > entire object until the GPU is idle.  As a result, large chunks of memory
> > > can be arbitrarily withdrawn from page migration, resulting in premature
> > > out-of-memory due to fragmentation.  However, if i915.ko can receive the
> > > migratepage() request, it can then flush the object from the GPU, remove
> > > its pin and thus enable the migration.
> > > 
> > > Since gfx allocations are one of the major consumer of system memory, its
> > > imperative to have such a mechanism to effectively deal with
> > > fragmentation.  And therefore the need for such a provision for initiating
> > > driver specific actions during address space operations.
> > 
> > Hm. Sorry, my ignorance, but shouldn't this kind of flushing be done in
> > response to mmu_notifier's ->invalidate_page?
> > 
> > I'm not aware about how i915 works and what's its expectation wrt shmem.
> > Do you have some userspace VMA which is mirrored on GPU side?
> > If yes, migration would cause unmapping of these pages and trigger the
> > mmu_notifier's hook.
> 
> We do that for userptr pages (i.e. stuff we steal from userspace address
> spaces). But we also have native gfx buffer objects based on shmem files,
> and thus far we need to allocate them as !GFP_MOVEABLE. And we allocate a
> _lot_ of those. And those files aren't mapped into any cpu address space
> (ofc they're mapped on the gpu side, but that's driver private), from the
> core mm they are pure pagecache. And afaiui for that we need to wire up
> the migratepage hooks through shmem to i915_gem.c

I see.

I don't particularly like the way patch hooks into migrate, but don't a
good idea how to implement this better.

This way allows to hook up to any shmem file, which can be abused by
drivers later.

I wounder if it would be better for i915 to have its own in-kernel mount
with variant of tmpfs which provides different mapping->a_ops? Or is it
overkill? I don't know.

Hugh?

-- 
 Kirill A. Shutemov
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 06/17] mm/shmem: expose driver overridable huge option

2017-05-16 Thread Kirill A. Shutemov
On Tue, May 16, 2017 at 09:29:37AM +0100, Matthew Auld wrote:
> In i915 we are aiming to support huge GTT pages for the GPU, and to
> complement this we also want to enable THP for our shmem backed objects.
> Even though THP is supported in shmemfs it can only be enabled through
> the huge= mount option, but for users of the kernel mounted shm_mnt like
> i915, we are a little stuck. There is the sysfs knob shmem_enabled to
> either forcefully enable/disable the feature, but that seems to only be
> useful for testing purposes. What we propose is to expose a driver
> overridable huge option as part of shmem_inode_info to control the use
> of THP for a given mapping.

I don't like this. It's kinda hacky.

Is there a reason why i915 cannot mount a new tmpfs for own use?

Or other option would be to change default to SHMEM_HUGE_ADVISE and wire
up fadvise handle to control per-file allocation policy.

-- 
 Kirill A. Shutemov
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 01/21] mm/shmem: introduce shmem_file_setup_with_mnt

2017-07-28 Thread Kirill A. Shutemov
On Tue, Jul 25, 2017 at 08:21:13PM +0100, Matthew Auld wrote:
> We are planning to use our own tmpfs mnt in i915 in place of the
> shm_mnt, such that we can control the mount options, in particular
> huge=, which we require to support huge-gtt-pages. So rather than roll
> our own version of __shmem_file_setup, it would be preferred if we could
> just give shmem our mnt, and let it do the rest.
> 
> Signed-off-by: Matthew Auld 
> Cc: Joonas Lahtinen 
> Cc: Chris Wilson 
> Cc: Dave Hansen 
> Cc: Kirill A. Shutemov 
> Cc: Hugh Dickins 
> Cc: linux...@kvack.org

Looks okay to me.

Acked-by: Kirill A. Shutemov 


-- 
 Kirill A. Shutemov
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 1/2] mm: Report attempts to overwrite PTE from remap_pfn_range()

2014-06-16 Thread Kirill A. Shutemov
Chris Wilson wrote:
> When using remap_pfn_range() from a fault handler, we are exposed to
> races between concurrent faults. Rather than hitting a BUG, report the
> error back to the caller, like vm_insert_pfn().
> 
> Signed-off-by: Chris Wilson 
> Cc: Andrew Morton 
> Cc: "Kirill A. Shutemov" 
> Cc: Peter Zijlstra 
> Cc: Rik van Riel 
> Cc: Mel Gorman 
> Cc: Cyrill Gorcunov 
> Cc: Johannes Weiner 
> Cc: linux...@kvack.org
> ---
>  mm/memory.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 037b812a9531..6603a9e6a731 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2306,19 +2306,23 @@ static int remap_pte_range(struct mm_struct *mm, 
> pmd_t *pmd,
>  {
>   pte_t *pte;
>   spinlock_t *ptl;
> + int ret = 0;
>  
>   pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
>   if (!pte)
>   return -ENOMEM;
>   arch_enter_lazy_mmu_mode();
>   do {
> - BUG_ON(!pte_none(*pte));
> + if (!pte_none(*pte)) {
> + ret = -EBUSY;
> + break;

I think you need at least remove entries you've setup if the check failed not
at first iteration.

And nobody propagate your -EBUSY back to remap_pfn_range(): caller will
see -ENOMEM, which is not what you want, I believe.

-- 
 Kirill A. Shutemov
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm: Report attempts to overwrite PTE from remap_pfn_range()

2014-06-19 Thread Kirill A. Shutemov
Chris Wilson wrote:
> When using remap_pfn_range() from a fault handler, we are exposed to
> races between concurrent faults. Rather than hitting a BUG, report the
> error back to the caller, like vm_insert_pfn().
> 
> v2: Fix the pte address for unmapping along the error path.
> v3: Report the error back and cleanup partial remaps.
> 
> Signed-off-by: Chris Wilson 
> Cc: Andrew Morton 
> Cc: "Kirill A. Shutemov" 
> Cc: Peter Zijlstra 
> Cc: Rik van Riel 
> Cc: Mel Gorman 
> Cc: Cyrill Gorcunov 
> Cc: Johannes Weiner 
> Cc: linux...@kvack.org
> ---
> 
> Whilst this has the semantics I want to allow two concurrent, but
> serialised, pagefaults that try to prefault the same object to succeed,
> it looks fragile and fraught with subtlety.
> -Chris
> 
> ---
>  mm/memory.c | 54 ++
>  1 file changed, 38 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index d67fd9f..be51fcc 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1657,32 +1657,41 @@ EXPORT_SYMBOL(vm_insert_mixed);
>   * in null mappings (currently treated as "copy-on-access")
>   */
>  static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
> - unsigned long addr, unsigned long end,
> - unsigned long pfn, pgprot_t prot)
> +unsigned long addr, unsigned long end,
> +unsigned long pfn, pgprot_t prot,
> +bool first)
>  {

With this long parameter list, wouldn't it cleaner to pass down a point to
structure instead? This could simplify the code, I believe.

>   pte_t *pte;
>   spinlock_t *ptl;
> + int err = 0;
>  
>   pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
>   if (!pte)
>   return -ENOMEM;
>   arch_enter_lazy_mmu_mode();
>   do {
> - BUG_ON(!pte_none(*pte));
> + if (!pte_none(*pte)) {
> + err = first ? -EBUSY : -EINVAL;
> + pte++;
> + break;
> + }
> + first = false;
>   set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
>   pfn++;
>   } while (pte++, addr += PAGE_SIZE, addr != end);
>   arch_leave_lazy_mmu_mode();
>   pte_unmap_unlock(pte - 1, ptl);
> - return 0;
> + return err;
>  }
>  
>  static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
> - unsigned long addr, unsigned long end,
> - unsigned long pfn, pgprot_t prot)
> +   unsigned long addr, unsigned long end,
> +   unsigned long pfn, pgprot_t prot,
> +   bool first)
>  {
>   pmd_t *pmd;
>   unsigned long next;
> + int err;
>  
>   pfn -= addr >> PAGE_SHIFT;
>   pmd = pmd_alloc(mm, pud, addr);
> @@ -1691,19 +1700,23 @@ static inline int remap_pmd_range(struct mm_struct 
> *mm, pud_t *pud,
>   VM_BUG_ON(pmd_trans_huge(*pmd));
>   do {
>   next = pmd_addr_end(addr, end);
> - if (remap_pte_range(mm, pmd, addr, next,
> - pfn + (addr >> PAGE_SHIFT), prot))
> - return -ENOMEM;
> + err = remap_pte_range(mm, pmd, addr, next,
> +   pfn + (addr >> PAGE_SHIFT), prot, first);
> + if (err)
> + return err;
> +
> + first = false;
>   } while (pmd++, addr = next, addr != end);
>   return 0;
>  }
>  
>  static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
> - unsigned long addr, unsigned long end,
> - unsigned long pfn, pgprot_t prot)
> +   unsigned long addr, unsigned long end,
> +   unsigned long pfn, pgprot_t prot, bool first)
>  {
>   pud_t *pud;
>   unsigned long next;
> + int err;
>  
>   pfn -= addr >> PAGE_SHIFT;
>   pud = pud_alloc(mm, pgd, addr);
> @@ -1711,9 +1724,12 @@ static inline int remap_pud_range(struct mm_struct 
> *mm, pgd_t *pgd,
>   return -ENOMEM;
>   do {
>   next = pud_addr_end(addr, end);
> - if (remap_pmd_range(mm, pud, addr, next,
> - pfn + (addr >> PAGE_SHIFT), prot))
> - return -ENOMEM;
> + err = remap_pmd_range(mm, pud, addr, next,
> +   pfn + (addr >> PAGE_SHIFT), prot, first);
> 

Re: [Intel-gfx] [PATCH] mm: Report attempts to overwrite PTE from remap_pfn_range()

2014-06-19 Thread Kirill A. Shutemov
Chris Wilson wrote:
> On Thu, Jun 19, 2014 at 02:50:18PM +0300, Kirill A. Shutemov wrote:
> > > + if (err) {
> > >   untrack_pfn(vma, pfn, PAGE_ALIGN(size));
> > > + if (err != -EBUSY)
> > > + zap_page_range_single(vma, addr, size, NULL);
> > 
> > Hm. If I read it correctly, you zap whole range, not only what you've
> > set up. Looks wrong.
> 
> Yes. I didn't fancy threading the last touched pte back, but that should
> be easier if moving to a struct.
>  
> > And for after zap, you probably whant to return -EBUSY to caller of
> > remap_pfn_range(), not -EINVAL.
> 
> No, it has to be EINVAL for my purpose. If we return EBUSY, the caller
> will just report VM_NOPAGE back to the fault handler, and the fault will
> be retriggered - but the overlapping object will still be present.

IIUC, what you're saying makes sense only if the range starts from PTE
you've got fault to. I failed to see why this assumption should be part of
remap_pfn_range() interface.

One possible option is to create a variant of remap_pfn_range() which will
return how many PTEs it was able to setup, before hitting the !pte_none().
Caller will decide what to do with partially filled range.

-- 
 Kirill A. Shutemov
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm: Report attempts to overwrite PTE from remap_pfn_range()

2014-06-19 Thread Kirill A. Shutemov
Chris Wilson wrote:
> On Thu, Jun 19, 2014 at 03:57:46PM +0300, Kirill A. Shutemov wrote:
> > One possible option is to create a variant of remap_pfn_range() which will
> > return how many PTEs it was able to setup, before hitting the !pte_none().
> > Caller will decide what to do with partially filled range.
> 
> Looked at just returning the address remap_pfn_range() got up to, which is
> easy enough, but I think given that remap_pfn_range() will clean up
> correctly after a failed remap, any EBUSY from partway through would be
> a pathological driver error. 

I would prefer keep remap_pfn_range() interface intact with BUG_ON() on
unexpected !pte_none() and introduce new function with more flexible
behaviour (sharing underlying infrastructure).
This way we can avoid changing every remap_pfn_range() caller.

-- 
 Kirill A. Shutemov
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 1/4] mm: Refactor remap_pfn_range()

2014-06-30 Thread Kirill A. Shutemov
Chris Wilson wrote:
> In preparation for exporting very similar functionality through another
> interface, gut the current remap_pfn_range(). The motivating factor here
> is to reuse the PGB/PUD/PMD/PTE walker, but allow back progation of
> errors rather than BUG_ON.
> 
> Signed-off-by: Chris Wilson 
> Cc: Andrew Morton 
> Cc: "Kirill A. Shutemov" 
> Cc: Peter Zijlstra 
> Cc: Rik van Riel 
> Cc: Mel Gorman 
> Cc: Cyrill Gorcunov 
> Cc: Johannes Weiner 
> Cc: linux...@kvack.org
> ---
>  mm/memory.c | 102 
> +---
>  1 file changed, 57 insertions(+), 45 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 037b812a9531..d2c7fe88a289 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2295,71 +2295,81 @@ int vm_insert_mixed(struct vm_area_struct *vma, 
> unsigned long addr,
>  }
>  EXPORT_SYMBOL(vm_insert_mixed);
>  
> +struct remap_pfn {
> + struct mm_struct *mm;
> + unsigned long addr;
> + unsigned long pfn;
> + pgprot_t prot;
> +};
> +
>  /*
>   * maps a range of physical memory into the requested pages. the old
>   * mappings are removed. any references to nonexistent pages results
>   * in null mappings (currently treated as "copy-on-access")
>   */
> -static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
> - unsigned long addr, unsigned long end,
> - unsigned long pfn, pgprot_t prot)
> +static inline int remap_pfn(struct remap_pfn *r, pte_t *pte)
> +{
> + if (!pte_none(*pte))
> + return -EBUSY;
> +
> + set_pte_at(r->mm, r->addr, pte,
> +pte_mkspecial(pfn_pte(r->pfn, r->prot)));
> + r->pfn++;
> + r->addr += PAGE_SIZE;
> + return 0;
> +}
> +
> +static int remap_pte_range(struct remap_pfn *r, pmd_t *pmd, unsigned long 
> end)
>  {
>   pte_t *pte;
>   spinlock_t *ptl;
> + int err;
>  
> - pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
> + pte = pte_alloc_map_lock(r->mm, pmd, r->addr, &ptl);
>   if (!pte)
>   return -ENOMEM;
> +
>   arch_enter_lazy_mmu_mode();
>   do {
> - BUG_ON(!pte_none(*pte));
> - set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
> - pfn++;
> - } while (pte++, addr += PAGE_SIZE, addr != end);
> + err = remap_pfn(r, pte++);
> + } while (err == 0 && r->addr < end);
>   arch_leave_lazy_mmu_mode();
> +
>   pte_unmap_unlock(pte - 1, ptl);
> - return 0;
> + return err;
>  }
>  
> -static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
> - unsigned long addr, unsigned long end,
> - unsigned long pfn, pgprot_t prot)
> +static inline int remap_pmd_range(struct remap_pfn *r, pud_t *pud, unsigned 
> long end)
>  {
>   pmd_t *pmd;
> - unsigned long next;
> + int err;
>  
> - pfn -= addr >> PAGE_SHIFT;
> - pmd = pmd_alloc(mm, pud, addr);
> + pmd = pmd_alloc(r->mm, pud, r->addr);
>   if (!pmd)
>   return -ENOMEM;
>   VM_BUG_ON(pmd_trans_huge(*pmd));
> +
>   do {
> - next = pmd_addr_end(addr, end);
> - if (remap_pte_range(mm, pmd, addr, next,
> - pfn + (addr >> PAGE_SHIFT), prot))
> - return -ENOMEM;
> - } while (pmd++, addr = next, addr != end);
> - return 0;
> + err = remap_pte_range(r, pmd++, pmd_addr_end(r->addr, end));
> + } while (err == 0 && r->addr < end);
> +
> + return err;
>  }
>  
> -static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
> - unsigned long addr, unsigned long end,
> - unsigned long pfn, pgprot_t prot)
> +static inline int remap_pud_range(struct remap_pfn *r, pgd_t *pgd, unsigned 
> long end)
>  {
>   pud_t *pud;
> - unsigned long next;
> + int err;
>  
> - pfn -= addr >> PAGE_SHIFT;
> - pud = pud_alloc(mm, pgd, addr);
> + pud = pud_alloc(r->mm, pgd, r->addr);
>   if (!pud)
>   return -ENOMEM;
> +
>   do {
> - next = pud_addr_end(addr, end);
> - if (remap_pmd_range(mm, pud, addr, next,
> - pfn + (addr >> PAGE_SHIFT), prot))
> - return -ENOMEM;
> - } while (pud++, addr = next, addr != end);
> - return 0;
> + err = remap_pmd_range(r, pud++, pud_addr_end(r->addr, end));
>

Re: [Intel-gfx] [PATCH 3/4] mm: Export remap_io_mapping()

2014-06-30 Thread Kirill A. Shutemov
Chris Wilson wrote:
> This is similar to remap_pfn_range(), and uses the recently refactor
> code to do the page table walking. The key difference is that is back
> propagates its error as this is required for use from within a pagefault
> handler. The other difference, is that it combine the page protection
> from io-mapping, which is known from when the io-mapping is created,
> with the per-vma page protection flags. This avoids having to walk the
> entire system description to rediscover the special page protection
> established for the io-mapping.
> 
> Signed-off-by: Chris Wilson 

Looks okay to me:

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [PATCHv3 06/11] mm/vmscan: Use PG_dropbehind instead of PG_reclaim

2025-02-03 Thread Kirill A. Shutemov
On Sat, Feb 01, 2025 at 04:01:43PM +0800, Kairui Song wrote:
> On Thu, Jan 30, 2025 at 6:02 PM Kirill A. Shutemov
>  wrote:
> >
> > The recently introduced PG_dropbehind allows for freeing folios
> > immediately after writeback. Unlike PG_reclaim, it does not need vmscan
> > to be involved to get the folio freed.
> >
> > Instead of using folio_set_reclaim(), use folio_set_dropbehind() in
> > pageout().
> >
> > It is safe to leave PG_dropbehind on the folio if, for some reason
> > (bug?), the folio is not in a writeback state after ->writepage().
> > In these cases, the kernel had to clear PG_reclaim as it shared a page
> > flag bit with PG_readahead.
> >
> > Signed-off-by: Kirill A. Shutemov 
> > Acked-by: David Hildenbrand 
> > ---
> >  mm/vmscan.c | 9 +++--
> >  1 file changed, 3 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index bc1826020159..c97adb0fdaa4 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -692,19 +692,16 @@ static pageout_t pageout(struct folio *folio, struct 
> > address_space *mapping,
> > if (shmem_mapping(mapping) && folio_test_large(folio))
> > wbc.list = folio_list;
> >
> > -   folio_set_reclaim(folio);
> > +   folio_set_dropbehind(folio);
> > +
> > res = mapping->a_ops->writepage(&folio->page, &wbc);
> > if (res < 0)
> > handle_write_error(mapping, folio, res);
> > if (res == AOP_WRITEPAGE_ACTIVATE) {
> > -   folio_clear_reclaim(folio);
> > +   folio_clear_dropbehind(folio);
> > return PAGE_ACTIVATE;
> > }
> >
> > -   if (!folio_test_writeback(folio)) {
> > -   /* synchronous write or broken a_ops? */
> > -   folio_clear_reclaim(folio);
> > -   }
> > trace_mm_vmscan_write_folio(folio);
> > node_stat_add_folio(folio, NR_VMSCAN_WRITE);
> > return PAGE_SUCCESS;
> > --
> > 2.47.2
> >
> 
> Hi, I'm seeing following panic with SWAP after this commit:
> 
> [   29.672319] Oops: general protection fault, probably for
> non-canonical address 0x88909a3be3:  [#1] PREEMPT SMP NOPTI
> [   29.675503] CPU: 82 UID: 0 PID: 5145 Comm: tar Kdump: loaded Not
> tainted 6.13.0.ptch-g1fe9ea48ec98 #917
> [   29.677508] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
> [   29.678886] RIP: 0010:__lock_acquire+0x20/0x15d0

Ouch.

I failed to trigger it my setup. Could you share your reproducer?

> I'm testing with PROVE_LOCKING on. It seems folio_unmap_invalidate is
> called for swapcache folio and it doesn't work well, following PATCH
> on top of mm-unstable seems fix it well:

Right. I don't understand swapping good enough. I missed this.

> diff --git a/mm/filemap.c b/mm/filemap.c
> index 4fe551037bf7..98493443d120 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1605,8 +1605,9 @@ static void folio_end_reclaim_write(struct folio *folio)
>  * invalidation in that case.
>  */
> if (in_task() && folio_trylock(folio)) {
> -   if (folio->mapping)
> -   folio_unmap_invalidate(folio->mapping, folio, 0);
> +   struct address_space *mapping = folio_mapping(folio);
> +   if (mapping)
> +   folio_unmap_invalidate(mapping, folio, 0);
> folio_unlock(folio);
> }
>  }

Once you do this, folio_unmap_invalidate() will never succeed for
swapcache as folio->mapping != mapping check will always be true and it
will fail with -EBUSY.

I guess we need to do something similar to what __remove_mapping() does
for swapcache folios.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


Re: [PATCH 0/8] mm: Remove PG_reclaim

2025-01-13 Thread Kirill A. Shutemov
On Mon, Jan 13, 2025 at 01:45:48PM +, Matthew Wilcox wrote:
> On Mon, Jan 13, 2025 at 11:34:45AM +0200, Kirill A. Shutemov wrote:
> > Use PG_dropbehind instead of PG_reclaim and remove PG_reclaim.
> 
> I was hoping we'd end up with the name PG_reclaim instead of the name
> PG_dropbehind.  PG_reclaim is a better name for this functionality.

I got burned by re-using the name with MAX_ORDER redefinition.
I guess it is less risky as it is less used, but still...

Anyway, it can be done with a patch on top of the patchset. We must get
rid of current PG_reclaim first.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


Re: [PATCH 4/8] mm/swap: Use PG_dropbehind instead of PG_reclaim

2025-01-14 Thread Kirill A. Shutemov
On Mon, Jan 13, 2025 at 08:17:20AM -0800, Yosry Ahmed wrote:
> On Mon, Jan 13, 2025 at 1:35 AM Kirill A. Shutemov
>  wrote:
> >
> > The recently introduced PG_dropbehind allows for freeing folios
> > immediately after writeback. Unlike PG_reclaim, it does not need vmscan
> > to be involved to get the folio freed.
> >
> > Instead of using folio_set_reclaim(), use folio_set_dropbehind() in
> > lru_deactivate_file().
> >
> > Signed-off-by: Kirill A. Shutemov 
> > ---
> >  mm/swap.c | 8 +---
> >  1 file changed, 1 insertion(+), 7 deletions(-)
> >
> > diff --git a/mm/swap.c b/mm/swap.c
> > index fc8281ef4241..4eb33b4804a8 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -562,14 +562,8 @@ static void lru_deactivate_file(struct lruvec *lruvec, 
> > struct folio *folio)
> > folio_clear_referenced(folio);
> >
> > if (folio_test_writeback(folio) || folio_test_dirty(folio)) {
> > -   /*
> > -* Setting the reclaim flag could race with
> > -* folio_end_writeback() and confuse readahead.  But the
> > -* race window is _really_ small and  it's not a critical
> > -* problem.
> > -*/
> > lruvec_add_folio(lruvec, folio);
> > -   folio_set_reclaim(folio);
> > +   folio_set_dropbehind(folio);
> > } else {
> > /*
> >  * The folio's writeback ended while it was in the batch.
> 
> Now there's a difference in behavior here depending on whether or not
> the folio is under writeback (or will be written back soon). If it is,
> we set PG_dropbehind to get it freed right after, but if writeback has
> already ended we put it on the tail of the LRU to be freed later.
> 
> It's a bit counterintuitive to me that folios with pending writeback
> get freed faster than folios that completed their writeback already.
> Am I missing something?

Yeah, it is strange.

I think we can drop the writeback/dirty check. Set PG_dropbehind and put
the page on the tail of LRU unconditionally. The check was required to
avoid confusion with PG_readahead.

Comment above the function is not valid anymore.

But the folio that is still dirty under writeback will be freed faster as
we get rid of the folio just after writeback is done while clean page can
dangle on LRU for a while.

I don't think we have any convenient place to free clean dropbehind page
other than shrink_folio_list(). Or do we?

Looking at shrink_folio_list(), I think we need to bypass page demotion
for PG_dropbehind pages.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


Re: [PATCH 8/8] mm: Remove PG_reclaim

2025-01-14 Thread Kirill A. Shutemov
On Mon, Jan 13, 2025 at 03:28:43PM +, Matthew Wilcox wrote:
> On Mon, Jan 13, 2025 at 11:34:53AM +0200, Kirill A. Shutemov wrote:
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index caadbe393aa2..beba72da5e33 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -686,6 +686,8 @@ void folio_migrate_flags(struct folio *newfolio, struct 
> > folio *folio)
> > folio_set_young(newfolio);
> > if (folio_test_idle(folio))
> > folio_set_idle(newfolio);
> > +   if (folio_test_readahead(folio))
> > +   folio_set_readahead(newfolio);
> >  
> > folio_migrate_refs(newfolio, folio);
> > /*
> 
> Not a problem with this patch ... but aren't we missing a
> test_dropbehind / set_dropbehind pair in this function?  Or are we
> prohibited from migrating a folio with the dropbehind flag set
> somewhere?

Hm. Good catch.

We might want to drop clean dropbehind pages instead migrating them.

But I am not sure about dirty ones. With slow backing storage it might be
better for the system to migrate them instead of keeping them in the old
place for potentially long time.

Any opinions?

> > +++ b/mm/swap.c
> > @@ -221,22 +221,6 @@ static void lru_move_tail(struct lruvec *lruvec, 
> > struct folio *folio)
> > __count_vm_events(PGROTATED, folio_nr_pages(folio));
> >  }
> >  
> > -/*
> > - * Writeback is about to end against a folio which has been marked for
> > - * immediate reclaim.  If it still appears to be reclaimable, move it
> > - * to the tail of the inactive list.
> > - *
> > - * folio_rotate_reclaimable() must disable IRQs, to prevent nasty races.
> > - */
> > -void folio_rotate_reclaimable(struct folio *folio)
> > -{
> > -   if (folio_test_locked(folio) || folio_test_dirty(folio) ||
> > -   folio_test_unevictable(folio))
> > -   return;
> > -
> > -   folio_batch_add_and_move(folio, lru_move_tail, true);
> > -}
> 
> I think this is the last caller of lru_move_tail(), which means we can
> get rid of fbatches->lru_move_tail and the local_lock that protects it.
> Or did I miss something?

I see lru_move_tail() being used by lru_add_drain_cpu().

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


Re: [PATCHv2 05/11] mm/truncate: Use folio_set_dropbehind() instead of deactivate_file_folio()k

2025-01-17 Thread Kirill A. Shutemov
On Wed, Jan 15, 2025 at 02:46:44PM -0700, Yu Zhao wrote:
> On Wed, Jan 15, 2025 at 2:35 PM Matthew Wilcox  wrote:
> >
> > On Wed, Jan 15, 2025 at 11:31:29AM +0200, Kirill A. Shutemov wrote:
> > > -static void lru_deactivate_file(struct lruvec *lruvec, struct folio 
> > > *folio)
> > > -{
> > > - bool active = folio_test_active(folio) || lru_gen_enabled();
> > > - long nr_pages = folio_nr_pages(folio);
> > > -
> > > - if (folio_test_unevictable(folio))
> > > - return;
> > > -
> > > - /* Some processes are using the folio */
> > > - if (folio_mapped(folio))
> > > - return;
> > > -
> > > - lruvec_del_folio(lruvec, folio);
> > > - folio_clear_active(folio);
> > > - folio_clear_referenced(folio);
> > > -
> > > - if (folio_test_writeback(folio) || folio_test_dirty(folio)) {
> > > - /*
> > > -  * Setting the reclaim flag could race with
> > > -  * folio_end_writeback() and confuse readahead.  But the
> > > -  * race window is _really_ small and  it's not a critical
> > > -  * problem.
> > > -  */
> > > - lruvec_add_folio(lruvec, folio);
> > > - folio_set_reclaim(folio);
> > > - } else {
> > > - /*
> > > -  * The folio's writeback ended while it was in the batch.
> > > -  * We move that folio to the tail of the inactive list.
> > > -  */
> > > - lruvec_add_folio_tail(lruvec, folio);
> > > - __count_vm_events(PGROTATED, nr_pages);
> > > - }
> > > -
> > > - if (active) {
> > > - __count_vm_events(PGDEACTIVATE, nr_pages);
> > > - __count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE,
> > > -  nr_pages);
> > > - }
> > > -}
> >
> > > +++ b/mm/truncate.c
> > > @@ -486,7 +486,7 @@ unsigned long mapping_try_invalidate(struct 
> > > address_space *mapping,
> > >* of interest and try to speed up its reclaim.
> > >*/
> > >   if (!ret) {
> > > - deactivate_file_folio(folio);
> > > + folio_set_dropbehind(folio);
> >
> > brr.
> >
> > This is a fairly substantial change in semantics, and maybe it's fine.
> >
> > At a high level, we're trying to remove pages from an inode that aren't
> > in use.  But we might find that some of them are in use (eg they're
> > mapped or under writeback).  If they are mapped, we don't currently
> > try to accelerate their reclaim, but now we're going to mark them
> > as dropbehind.  I think that's wrong.
> >
> > If they're dirty or under writeback, then yes, mark them as dropbehind, but
> > I think we need to be a little more surgical here.  Maybe preserve the
> > unevictable check too.
> 
> Right -- deactivate_file_folio() does make sure the folio is not
> unevictable or mapped. So probably something like below would the
> change in semantics be close enough?
> 
>   if (!folio_test_unevictable(folio) && !folio_mapped(folio))
> folio_set_dropbehind(folio);

Okay, makes sense to me.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


Re: [PATCHv2 11/11] mm: Rename PG_dropbehind to PG_reclaim

2025-01-17 Thread Kirill A. Shutemov
On Wed, Jan 15, 2025 at 10:18:16PM -0800, Christoph Hellwig wrote:
> On Wed, Jan 15, 2025 at 11:31:35AM +0200, Kirill A. Shutemov wrote:
> > Now as PG_reclaim is gone, its name can be reclaimed for better
> > use :)
> > 
> > Rename PG_dropbehind to PG_reclaim and rename all helpers around it.
> 
> Why?  reclaim is completely generic and reclaim can mean many
> different things.  dropbehind is much more specific.

Dropbehind is somewhat obscure name. You need fair bit of context to
understand what it does.

But I don't care that much. We can keep it as PG_dropbehind.

Anybody else has opinion on this?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov