Re: [PATCH 01/12] mm/gup.c: Remove redundant check for PCI P2PDMA page

2024-09-21 Thread Dan Williams
Alistair Popple wrote:
> PCI P2PDMA pages are not mapped with pXX_devmap PTEs therefore the
> check in __gup_device_huge() is redundant. Remove it
> 
> Signed-off-by: Alistair Popple 
> Reviewed-by: Jason Gunthorpe 
> Acked-by: David Hildenbrand 
> ---
>  mm/gup.c | 5 -
>  1 file changed, 5 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index d19884e..5d2fc9a 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2954,11 +2954,6 @@ static int gup_fast_devmap_leaf(unsigned long pfn, 
> unsigned long addr,
>   break;
>   }
>  
> - if (!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
> - gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
> - break;
> - }
> -

Looks good.

Reviewed-by: Dan Wiliams 



Re: [PATCH 04/12] mm: Allow compound zone device pages

2024-09-21 Thread Dan Williams
Alistair Popple wrote:
> Zone device pages are used to represent various type of device memory
> managed by device drivers. Currently compound zone device pages are
> not supported. This is because MEMORY_DEVICE_FS_DAX pages are the only
> user of higher order zone device pages and have their own page
> reference counting.
> 
> A future change will unify FS DAX reference counting with normal page
> reference counting rules and remove the special FS DAX reference
> counting. Supporting that requires compound zone device pages.
> 
> Supporting compound zone device pages requires compound_head() to
> distinguish between head and tail pages whilst still preserving the
> special struct page fields that are specific to zone device pages.
> 
> A tail page is distinguished by having bit zero being set in
> page->compound_head, with the remaining bits pointing to the head
> page. For zone device pages page->compound_head is shared with
> page->pgmap.
> 
> The page->pgmap field is common to all pages within a memory section.
> Therefore pgmap is the same for both head and tail pages and can be
> moved into the folio and we can use the standard scheme to find
> compound_head from a tail page.
> 
> Signed-off-by: Alistair Popple 
> Reviewed-by: Jason Gunthorpe 
> 
> ---
> 
> Changes since v1:
> 
>  - Move pgmap to the folio as suggested by Matthew Wilcox
> ---
>  drivers/gpu/drm/nouveau/nouveau_dmem.c |  3 ++-
>  drivers/pci/p2pdma.c   |  6 +++---
>  include/linux/memremap.h   |  6 +++---
>  include/linux/migrate.h|  4 ++--
>  include/linux/mm_types.h   |  9 +++--
>  include/linux/mmzone.h |  8 +++-
>  lib/test_hmm.c |  3 ++-
>  mm/hmm.c   |  2 +-
>  mm/memory.c|  4 +++-
>  mm/memremap.c  | 14 +++---
>  mm/migrate_device.c|  7 +--
>  mm/mm_init.c   |  2 +-
>  12 files changed, 43 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
> b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> index 6fb65b0..58d308c 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> @@ -88,7 +88,8 @@ struct nouveau_dmem {
>  
>  static struct nouveau_dmem_chunk *nouveau_page_to_chunk(struct page *page)
>  {
> - return container_of(page->pgmap, struct nouveau_dmem_chunk, pagemap);
> + return container_of(page_dev_pagemap(page), struct nouveau_dmem_chunk,

page_dev_pagemap() feels like a mouthful. I would be ok with
page_pgmap() since that is the most common idenifier for struct
struct dev_pagemap instances.

> + pagemap);
>  }
>  
>  static struct nouveau_drm *page_to_drm(struct page *page)
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 210b9f4..a58f2c1 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -199,7 +199,7 @@ static const struct attribute_group p2pmem_group = {
>  
>  static void p2pdma_page_free(struct page *page)
>  {
> - struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
> + struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_dev_pagemap(page));
>   /* safe to dereference while a reference is held to the percpu ref */
>   struct pci_p2pdma *p2pdma =
>   rcu_dereference_protected(pgmap->provider->p2pdma, 1);
> @@ -1022,8 +1022,8 @@ enum pci_p2pdma_map_type
>  pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device 
> *dev,
>  struct scatterlist *sg)
>  {
> - if (state->pgmap != sg_page(sg)->pgmap) {
> - state->pgmap = sg_page(sg)->pgmap;
> + if (state->pgmap != page_dev_pagemap(sg_page(sg))) {
> + state->pgmap = page_dev_pagemap(sg_page(sg));
>   state->map = pci_p2pdma_map_type(state->pgmap, dev);
>   state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
>   }
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 3f7143a..14273e6 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -161,7 +161,7 @@ static inline bool is_device_private_page(const struct 
> page *page)
>  {
>   return IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
>   is_zone_device_page(page) &&
> - page->pgmap->type == MEMORY_DEVICE_PRIVATE;
> + page_dev_pagemap(page)->type == MEMORY_DEVICE_PRIVATE;
>  }
>  
>  static inline bool folio_is_device_private(const struct folio *folio)
> @@ -173,13 +173,13 @@ static inline bool is_pci_p2pdma_page(const struct page 
> *page)
>  {
>   return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
>   is_zone_device_page(page) &&
> - page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
> + page_dev_pagemap(page)->type == MEMORY_DEVICE_PCI_P2PDMA;
>  }
>  
>  static inline bool is_device_coherent_page(const struct page

Re: [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one

2024-09-21 Thread Dan Williams
Alistair Popple wrote:
> The reference counts for ZONE_DEVICE private pages should be
> initialised by the driver when the page is actually allocated by the
> driver allocator, not when they are first created. This is currently
> the case for MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_COHERENT pages
> but not MEMORY_DEVICE_PCI_P2PDMA pages so fix that up.
> 
> Signed-off-by: Alistair Popple 
> ---
>  drivers/pci/p2pdma.c |  6 ++
>  mm/memremap.c| 17 +
>  mm/mm_init.c | 22 ++
>  3 files changed, 37 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 4f47a13..210b9f4 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -129,6 +129,12 @@ static int p2pmem_alloc_mmap(struct file *filp, struct 
> kobject *kobj,
>   }
>  
>   /*
> +  * Initialise the refcount for the freshly allocated page. As we have
> +  * just allocated the page no one else should be using it.
> +  */
> + set_page_count(virt_to_page(kaddr), 1);

Perhaps VM_WARN_ONCE to back up that assumption?

I also notice that there are multiple virt_to_page() lookups in this
routine, so maybe time for a local @page variable.

> +
> + /*
>* vm_insert_page() can sleep, so a reference is taken to mapping
>* such that rcu_read_unlock() can be done before inserting the
>* pages
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 40d4547..07bbe0e 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -488,15 +488,24 @@ void free_zone_device_folio(struct folio *folio)
>   folio->mapping = NULL;
>   folio->page.pgmap->ops->page_free(folio_page(folio, 0));
>  
> - if (folio->page.pgmap->type != MEMORY_DEVICE_PRIVATE &&
> - folio->page.pgmap->type != MEMORY_DEVICE_COHERENT)
> + switch (folio->page.pgmap->type) {
> + case MEMORY_DEVICE_PRIVATE:
> + case MEMORY_DEVICE_COHERENT:
> + put_dev_pagemap(folio->page.pgmap);
> + break;
> +
> + case MEMORY_DEVICE_FS_DAX:
> + case MEMORY_DEVICE_GENERIC:
>   /*
>* Reset the refcount to 1 to prepare for handing out the page
>* again.
>*/
>   folio_set_count(folio, 1);
> - else
> - put_dev_pagemap(folio->page.pgmap);
> + break;
> +
> + case MEMORY_DEVICE_PCI_P2PDMA:
> + break;

A follow on cleanup is that either all implementations should be
put_dev_pagemap(), or none of them. Put the onus on the implementation
to track how many pages it has handed out in the implementation
allocator.

> + }
>  }
>  
>  void zone_device_page_init(struct page *page)
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 4ba5607..0489820 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1015,12 +1015,26 @@ static void __ref __init_zone_device_page(struct page 
> *page, unsigned long pfn,
>   }
>  
>   /*
> -  * ZONE_DEVICE pages are released directly to the driver page allocator
> -  * which will set the page count to 1 when allocating the page.
> +  * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC and
> +  * MEMORY_TYPE_FS_DAX pages are released directly to the driver page
> +  * allocator which will set the page count to 1 when allocating the
> +  * page.
> +  *
> +  * MEMORY_TYPE_GENERIC and MEMORY_TYPE_FS_DAX pages automatically have
> +  * their refcount reset to one whenever they are freed (ie. after
> +  * their refcount drops to 0).

I'll send some follow on patches to clean up device-dax.

For this one:

Reviewed-by: Dan Williams 



Re: [PATCH 03/12] fs/dax: Refactor wait for dax idle page

2024-09-21 Thread Dan Williams
Alistair Popple wrote:
> A FS DAX page is considered idle when its refcount drops to one. This
> is currently open-coded in all file systems supporting FS DAX. Move
> the idle detection to a common function to make future changes easier.
> 
> Signed-off-by: Alistair Popple 
> Reviewed-by: Jan Kara 
> Reviewed-by: Christoph Hellwig 

Looks good to me:

Reviewed-by: Dan Williams 



Re: [PATCH 06/12] huge_memory: Allow mappings of PUD sized pages

2024-09-21 Thread Dan Williams
Alistair Popple wrote:
> Currently DAX folio/page reference counts are managed differently to
> normal pages. To allow these to be managed the same as normal pages
> introduce dax_insert_pfn_pud. This will map the entire PUD-sized folio
> and take references as it would for a normally mapped page.
> 
> This is distinct from the current mechanism, vmf_insert_pfn_pud, which
> simply inserts a special devmap PUD entry into the page table without
> holding a reference to the page for the mapping.

This is missing some description or comment in the code about the
differences. More questions below:

> Signed-off-by: Alistair Popple 
> ---
>  include/linux/huge_mm.h |  4 ++-
>  include/linux/rmap.h| 15 +++-
>  mm/huge_memory.c| 93 --
>  mm/rmap.c   | 49 ++-
>  4 files changed, 149 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 6370026..d3a1872 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -40,6 +40,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct 
> vm_area_struct *vma,
>  
>  vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
>  vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
> +vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
>  
>  enum transparent_hugepage_flag {
>   TRANSPARENT_HUGEPAGE_UNSUPPORTED,
> @@ -114,6 +115,9 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
>  #define HPAGE_PUD_MASK   (~(HPAGE_PUD_SIZE - 1))
>  #define HPAGE_PUD_SIZE   ((1UL) << HPAGE_PUD_SHIFT)
>  
> +#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
> +#define HPAGE_PUD_NR (1< +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  
>  extern unsigned long transparent_hugepage_flags;
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 91b5935..c465694 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -192,6 +192,7 @@ typedef int __bitwise rmap_t;
>  enum rmap_level {
>   RMAP_LEVEL_PTE = 0,
>   RMAP_LEVEL_PMD,
> + RMAP_LEVEL_PUD,
>  };
>  
>  static inline void __folio_rmap_sanity_checks(struct folio *folio,
> @@ -228,6 +229,14 @@ static inline void __folio_rmap_sanity_checks(struct 
> folio *folio,
>   VM_WARN_ON_FOLIO(folio_nr_pages(folio) != HPAGE_PMD_NR, folio);
>   VM_WARN_ON_FOLIO(nr_pages != HPAGE_PMD_NR, folio);
>   break;
> + case RMAP_LEVEL_PUD:
> + /*
> +  * Asume that we are creating * a single "entire" mapping of the
> +  * folio.
> +  */
> + VM_WARN_ON_FOLIO(folio_nr_pages(folio) != HPAGE_PUD_NR, folio);
> + VM_WARN_ON_FOLIO(nr_pages != HPAGE_PUD_NR, folio);
> + break;
>   default:
>   VM_WARN_ON_ONCE(true);
>   }
> @@ -251,12 +260,16 @@ void folio_add_file_rmap_ptes(struct folio *, struct 
> page *, int nr_pages,
>   folio_add_file_rmap_ptes(folio, page, 1, vma)
>  void folio_add_file_rmap_pmd(struct folio *, struct page *,
>   struct vm_area_struct *);
> +void folio_add_file_rmap_pud(struct folio *, struct page *,
> + struct vm_area_struct *);
>  void folio_remove_rmap_ptes(struct folio *, struct page *, int nr_pages,
>   struct vm_area_struct *);
>  #define folio_remove_rmap_pte(folio, page, vma) \
>   folio_remove_rmap_ptes(folio, page, 1, vma)
>  void folio_remove_rmap_pmd(struct folio *, struct page *,
>   struct vm_area_struct *);
> +void folio_remove_rmap_pud(struct folio *, struct page *,
> + struct vm_area_struct *);
>  
>  void hugetlb_add_anon_rmap(struct folio *, struct vm_area_struct *,
>   unsigned long address, rmap_t flags);
> @@ -341,6 +354,7 @@ static __always_inline void __folio_dup_file_rmap(struct 
> folio *folio,
>   atomic_add(orig_nr_pages, &folio->_large_mapcount);
>   break;
>   case RMAP_LEVEL_PMD:
> + case RMAP_LEVEL_PUD:
>   atomic_inc(&folio->_entire_mapcount);
>   atomic_inc(&folio->_large_mapcount);
>   break;
> @@ -437,6 +451,7 @@ static __always_inline int 
> __folio_try_dup_anon_rmap(struct folio *folio,
>   atomic_add(orig_nr_pages, &folio->_large_mapcount);
>   break;
>   case RMAP_LEVEL_PMD:
> + case RMAP_LEVEL_PUD:
>   if (PageAnonExclusive(page)) {
>   if (unlikely(maybe_pinned))
>   return -EBUSY;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c4b45ad..e8985a4 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1336,21 +1336,19 @@ static void insert_pfn_pud(struct vm_area_struct 
> *vma, unsigned long addr,
>   struct mm_struct *mm = vma->vm_mm;
>   pgprot_t prot = vma->vm_page_prot;
>   pud_t entry;
> - spinlock_t *ptl;
>  
> - ptl = pud

Re: [PATCH 05/12] mm/memory: Add dax_insert_pfn

2024-09-21 Thread Dan Williams
[ add s390 folks to comment on CONFIG_FS_DAX_LIMITED ]

Alistair Popple wrote:
> Currently to map a DAX page the DAX driver calls vmf_insert_pfn. This
> creates a special devmap PTE entry for the pfn but does not take a
> reference on the underlying struct page for the mapping. This is
> because DAX page refcounts are treated specially, as indicated by the
> presence of a devmap entry.
> 
> To allow DAX page refcounts to be managed the same as normal page
> refcounts introduce dax_insert_pfn. This will take a reference on the
> underlying page much the same as vmf_insert_page, except it also
> permits upgrading an existing mapping to be writable if
> requested/possible.
> 
> Signed-off-by: Alistair Popple 
> 
> ---
> 
> Updates from v1:
> 
>  - Re-arrange code in insert_page_into_pte_locked() based on comments
>from Jan Kara.
> 
>  - Call mkdrity/mkyoung for the mkwrite case, also suggested by Jan.
> ---
>  include/linux/mm.h |  1 +-
>  mm/memory.c| 83 ++-
>  2 files changed, 76 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b0ff06d..ae6d713 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3463,6 +3463,7 @@ int vm_map_pages(struct vm_area_struct *vma, struct 
> page **pages,
>   unsigned long num);
>  int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
>   unsigned long num);
> +vm_fault_t dax_insert_pfn(struct vm_fault *vmf, pfn_t pfn_t, bool write);
>  vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
>   unsigned long pfn);
>  vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long 
> addr,
> diff --git a/mm/memory.c b/mm/memory.c
> index d2785fb..368e15d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2039,19 +2039,47 @@ static int validate_page_before_insert(struct 
> vm_area_struct *vma,
>  }
>  
>  static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t 
> *pte,
> - unsigned long addr, struct page *page, pgprot_t prot)
> + unsigned long addr, struct page *page,
> + pgprot_t prot, bool mkwrite)

This upgrade of insert_page_into_pte_locked() to handle write faults
deserves to be its own patch with rationale along the lines of:

"In preparation for using insert_page() for DAX, enhance
insert_page_into_pte_locked() to handle establishing writable mappings.
Recall that DAX returns VM_FAULT_NOPAGE after installing a PTE which
bypasses the typical set_pte_range() in finish_fault."


>  {
>   struct folio *folio = page_folio(page);
> + pte_t entry = ptep_get(pte);
>   pte_t pteval;
>  
> - if (!pte_none(ptep_get(pte)))
> - return -EBUSY;
> + if (!pte_none(entry)) {
> + if (!mkwrite)
> + return -EBUSY;
> +
> + /*
> +  * For read faults on private mappings the PFN passed in may not
> +  * match the PFN we have mapped if the mapped PFN is a writeable
> +  * COW page.  In the mkwrite case we are creating a writable PTE
> +  * for a shared mapping and we expect the PFNs to match. If they
> +  * don't match, we are likely racing with block allocation and
> +  * mapping invalidation so just skip the update.
> +  */
> + if (pte_pfn(entry) != page_to_pfn(page)) {
> + WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry)));
> + return -EFAULT;
> + }
> + entry = maybe_mkwrite(entry, vma);
> + entry = pte_mkyoung(entry);
> + if (ptep_set_access_flags(vma, addr, pte, entry, 1))
> + update_mmu_cache(vma, addr, pte);

I was going to say that this should be creating a shared helper with
insert_pfn(), but on closer look the mkwrite case in insert_pfn() is now
dead code (almost, *grumbles about dcssblk*). So I would just mention
that in the changelog for this standalone patch and then we can follow
on with a cleanup like the patch at the bottom of this mail (untested).

> + return 0;
> + }
> +
>   /* Ok, finally just insert the thing.. */
>   pteval = mk_pte(page, prot);
>   if (unlikely(is_zero_folio(folio))) {
>   pteval = pte_mkspecial(pteval);
>   } else {
>   folio_get(folio);
> + entry = mk_pte(page, prot);
> + if (mkwrite) {
> + entry = pte_mkyoung(entry);
> + entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> + }
>   inc_mm_counter(vma->vm_mm, mm_counter_file(folio));
>   folio_add_file_rmap_pte(folio, page, vma);
>   }
> @@ -2060,7 +2088,7 @@ static int insert_page_into_pte_locked(struct 
> vm_area_struct *vma, pte_t *pte,
>  }
>  
>  static int in