Re: [PATCH 01/12] mm/gup.c: Remove redundant check for PCI P2PDMA page
Alistair Popple wrote: > PCI P2PDMA pages are not mapped with pXX_devmap PTEs therefore the > check in __gup_device_huge() is redundant. Remove it > > Signed-off-by: Alistair Popple > Reviewed-by: Jason Gunthorpe > Acked-by: David Hildenbrand > --- > mm/gup.c | 5 - > 1 file changed, 5 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index d19884e..5d2fc9a 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2954,11 +2954,6 @@ static int gup_fast_devmap_leaf(unsigned long pfn, > unsigned long addr, > break; > } > > - if (!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)) { > - gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages); > - break; > - } > - Looks good. Reviewed-by: Dan Wiliams
Re: [PATCH 04/12] mm: Allow compound zone device pages
Alistair Popple wrote: > Zone device pages are used to represent various type of device memory > managed by device drivers. Currently compound zone device pages are > not supported. This is because MEMORY_DEVICE_FS_DAX pages are the only > user of higher order zone device pages and have their own page > reference counting. > > A future change will unify FS DAX reference counting with normal page > reference counting rules and remove the special FS DAX reference > counting. Supporting that requires compound zone device pages. > > Supporting compound zone device pages requires compound_head() to > distinguish between head and tail pages whilst still preserving the > special struct page fields that are specific to zone device pages. > > A tail page is distinguished by having bit zero being set in > page->compound_head, with the remaining bits pointing to the head > page. For zone device pages page->compound_head is shared with > page->pgmap. > > The page->pgmap field is common to all pages within a memory section. > Therefore pgmap is the same for both head and tail pages and can be > moved into the folio and we can use the standard scheme to find > compound_head from a tail page. > > Signed-off-by: Alistair Popple > Reviewed-by: Jason Gunthorpe > > --- > > Changes since v1: > > - Move pgmap to the folio as suggested by Matthew Wilcox > --- > drivers/gpu/drm/nouveau/nouveau_dmem.c | 3 ++- > drivers/pci/p2pdma.c | 6 +++--- > include/linux/memremap.h | 6 +++--- > include/linux/migrate.h| 4 ++-- > include/linux/mm_types.h | 9 +++-- > include/linux/mmzone.h | 8 +++- > lib/test_hmm.c | 3 ++- > mm/hmm.c | 2 +- > mm/memory.c| 4 +++- > mm/memremap.c | 14 +++--- > mm/migrate_device.c| 7 +-- > mm/mm_init.c | 2 +- > 12 files changed, 43 insertions(+), 25 deletions(-) > > diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c > b/drivers/gpu/drm/nouveau/nouveau_dmem.c > index 6fb65b0..58d308c 100644 > --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c > +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c > @@ -88,7 +88,8 @@ struct nouveau_dmem { > > static struct nouveau_dmem_chunk *nouveau_page_to_chunk(struct page *page) > { > - return container_of(page->pgmap, struct nouveau_dmem_chunk, pagemap); > + return container_of(page_dev_pagemap(page), struct nouveau_dmem_chunk, page_dev_pagemap() feels like a mouthful. I would be ok with page_pgmap() since that is the most common idenifier for struct struct dev_pagemap instances. > + pagemap); > } > > static struct nouveau_drm *page_to_drm(struct page *page) > diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c > index 210b9f4..a58f2c1 100644 > --- a/drivers/pci/p2pdma.c > +++ b/drivers/pci/p2pdma.c > @@ -199,7 +199,7 @@ static const struct attribute_group p2pmem_group = { > > static void p2pdma_page_free(struct page *page) > { > - struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap); > + struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_dev_pagemap(page)); > /* safe to dereference while a reference is held to the percpu ref */ > struct pci_p2pdma *p2pdma = > rcu_dereference_protected(pgmap->provider->p2pdma, 1); > @@ -1022,8 +1022,8 @@ enum pci_p2pdma_map_type > pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device > *dev, > struct scatterlist *sg) > { > - if (state->pgmap != sg_page(sg)->pgmap) { > - state->pgmap = sg_page(sg)->pgmap; > + if (state->pgmap != page_dev_pagemap(sg_page(sg))) { > + state->pgmap = page_dev_pagemap(sg_page(sg)); > state->map = pci_p2pdma_map_type(state->pgmap, dev); > state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset; > } > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > index 3f7143a..14273e6 100644 > --- a/include/linux/memremap.h > +++ b/include/linux/memremap.h > @@ -161,7 +161,7 @@ static inline bool is_device_private_page(const struct > page *page) > { > return IS_ENABLED(CONFIG_DEVICE_PRIVATE) && > is_zone_device_page(page) && > - page->pgmap->type == MEMORY_DEVICE_PRIVATE; > + page_dev_pagemap(page)->type == MEMORY_DEVICE_PRIVATE; > } > > static inline bool folio_is_device_private(const struct folio *folio) > @@ -173,13 +173,13 @@ static inline bool is_pci_p2pdma_page(const struct page > *page) > { > return IS_ENABLED(CONFIG_PCI_P2PDMA) && > is_zone_device_page(page) && > - page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; > + page_dev_pagemap(page)->type == MEMORY_DEVICE_PCI_P2PDMA; > } > > static inline bool is_device_coherent_page(const struct page
Re: [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one
Alistair Popple wrote: > The reference counts for ZONE_DEVICE private pages should be > initialised by the driver when the page is actually allocated by the > driver allocator, not when they are first created. This is currently > the case for MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_COHERENT pages > but not MEMORY_DEVICE_PCI_P2PDMA pages so fix that up. > > Signed-off-by: Alistair Popple > --- > drivers/pci/p2pdma.c | 6 ++ > mm/memremap.c| 17 + > mm/mm_init.c | 22 ++ > 3 files changed, 37 insertions(+), 8 deletions(-) > > diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c > index 4f47a13..210b9f4 100644 > --- a/drivers/pci/p2pdma.c > +++ b/drivers/pci/p2pdma.c > @@ -129,6 +129,12 @@ static int p2pmem_alloc_mmap(struct file *filp, struct > kobject *kobj, > } > > /* > + * Initialise the refcount for the freshly allocated page. As we have > + * just allocated the page no one else should be using it. > + */ > + set_page_count(virt_to_page(kaddr), 1); Perhaps VM_WARN_ONCE to back up that assumption? I also notice that there are multiple virt_to_page() lookups in this routine, so maybe time for a local @page variable. > + > + /* >* vm_insert_page() can sleep, so a reference is taken to mapping >* such that rcu_read_unlock() can be done before inserting the >* pages > diff --git a/mm/memremap.c b/mm/memremap.c > index 40d4547..07bbe0e 100644 > --- a/mm/memremap.c > +++ b/mm/memremap.c > @@ -488,15 +488,24 @@ void free_zone_device_folio(struct folio *folio) > folio->mapping = NULL; > folio->page.pgmap->ops->page_free(folio_page(folio, 0)); > > - if (folio->page.pgmap->type != MEMORY_DEVICE_PRIVATE && > - folio->page.pgmap->type != MEMORY_DEVICE_COHERENT) > + switch (folio->page.pgmap->type) { > + case MEMORY_DEVICE_PRIVATE: > + case MEMORY_DEVICE_COHERENT: > + put_dev_pagemap(folio->page.pgmap); > + break; > + > + case MEMORY_DEVICE_FS_DAX: > + case MEMORY_DEVICE_GENERIC: > /* >* Reset the refcount to 1 to prepare for handing out the page >* again. >*/ > folio_set_count(folio, 1); > - else > - put_dev_pagemap(folio->page.pgmap); > + break; > + > + case MEMORY_DEVICE_PCI_P2PDMA: > + break; A follow on cleanup is that either all implementations should be put_dev_pagemap(), or none of them. Put the onus on the implementation to track how many pages it has handed out in the implementation allocator. > + } > } > > void zone_device_page_init(struct page *page) > diff --git a/mm/mm_init.c b/mm/mm_init.c > index 4ba5607..0489820 100644 > --- a/mm/mm_init.c > +++ b/mm/mm_init.c > @@ -1015,12 +1015,26 @@ static void __ref __init_zone_device_page(struct page > *page, unsigned long pfn, > } > > /* > - * ZONE_DEVICE pages are released directly to the driver page allocator > - * which will set the page count to 1 when allocating the page. > + * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC and > + * MEMORY_TYPE_FS_DAX pages are released directly to the driver page > + * allocator which will set the page count to 1 when allocating the > + * page. > + * > + * MEMORY_TYPE_GENERIC and MEMORY_TYPE_FS_DAX pages automatically have > + * their refcount reset to one whenever they are freed (ie. after > + * their refcount drops to 0). I'll send some follow on patches to clean up device-dax. For this one: Reviewed-by: Dan Williams
Re: [PATCH 03/12] fs/dax: Refactor wait for dax idle page
Alistair Popple wrote: > A FS DAX page is considered idle when its refcount drops to one. This > is currently open-coded in all file systems supporting FS DAX. Move > the idle detection to a common function to make future changes easier. > > Signed-off-by: Alistair Popple > Reviewed-by: Jan Kara > Reviewed-by: Christoph Hellwig Looks good to me: Reviewed-by: Dan Williams
Re: [PATCH 06/12] huge_memory: Allow mappings of PUD sized pages
Alistair Popple wrote: > Currently DAX folio/page reference counts are managed differently to > normal pages. To allow these to be managed the same as normal pages > introduce dax_insert_pfn_pud. This will map the entire PUD-sized folio > and take references as it would for a normally mapped page. > > This is distinct from the current mechanism, vmf_insert_pfn_pud, which > simply inserts a special devmap PUD entry into the page table without > holding a reference to the page for the mapping. This is missing some description or comment in the code about the differences. More questions below: > Signed-off-by: Alistair Popple > --- > include/linux/huge_mm.h | 4 ++- > include/linux/rmap.h| 15 +++- > mm/huge_memory.c| 93 -- > mm/rmap.c | 49 ++- > 4 files changed, 149 insertions(+), 12 deletions(-) > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 6370026..d3a1872 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -40,6 +40,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct > vm_area_struct *vma, > > vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write); > vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write); > +vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write); > > enum transparent_hugepage_flag { > TRANSPARENT_HUGEPAGE_UNSUPPORTED, > @@ -114,6 +115,9 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr; > #define HPAGE_PUD_MASK (~(HPAGE_PUD_SIZE - 1)) > #define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT) > > +#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT) > +#define HPAGE_PUD_NR (1< + > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > > extern unsigned long transparent_hugepage_flags; > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > index 91b5935..c465694 100644 > --- a/include/linux/rmap.h > +++ b/include/linux/rmap.h > @@ -192,6 +192,7 @@ typedef int __bitwise rmap_t; > enum rmap_level { > RMAP_LEVEL_PTE = 0, > RMAP_LEVEL_PMD, > + RMAP_LEVEL_PUD, > }; > > static inline void __folio_rmap_sanity_checks(struct folio *folio, > @@ -228,6 +229,14 @@ static inline void __folio_rmap_sanity_checks(struct > folio *folio, > VM_WARN_ON_FOLIO(folio_nr_pages(folio) != HPAGE_PMD_NR, folio); > VM_WARN_ON_FOLIO(nr_pages != HPAGE_PMD_NR, folio); > break; > + case RMAP_LEVEL_PUD: > + /* > + * Asume that we are creating * a single "entire" mapping of the > + * folio. > + */ > + VM_WARN_ON_FOLIO(folio_nr_pages(folio) != HPAGE_PUD_NR, folio); > + VM_WARN_ON_FOLIO(nr_pages != HPAGE_PUD_NR, folio); > + break; > default: > VM_WARN_ON_ONCE(true); > } > @@ -251,12 +260,16 @@ void folio_add_file_rmap_ptes(struct folio *, struct > page *, int nr_pages, > folio_add_file_rmap_ptes(folio, page, 1, vma) > void folio_add_file_rmap_pmd(struct folio *, struct page *, > struct vm_area_struct *); > +void folio_add_file_rmap_pud(struct folio *, struct page *, > + struct vm_area_struct *); > void folio_remove_rmap_ptes(struct folio *, struct page *, int nr_pages, > struct vm_area_struct *); > #define folio_remove_rmap_pte(folio, page, vma) \ > folio_remove_rmap_ptes(folio, page, 1, vma) > void folio_remove_rmap_pmd(struct folio *, struct page *, > struct vm_area_struct *); > +void folio_remove_rmap_pud(struct folio *, struct page *, > + struct vm_area_struct *); > > void hugetlb_add_anon_rmap(struct folio *, struct vm_area_struct *, > unsigned long address, rmap_t flags); > @@ -341,6 +354,7 @@ static __always_inline void __folio_dup_file_rmap(struct > folio *folio, > atomic_add(orig_nr_pages, &folio->_large_mapcount); > break; > case RMAP_LEVEL_PMD: > + case RMAP_LEVEL_PUD: > atomic_inc(&folio->_entire_mapcount); > atomic_inc(&folio->_large_mapcount); > break; > @@ -437,6 +451,7 @@ static __always_inline int > __folio_try_dup_anon_rmap(struct folio *folio, > atomic_add(orig_nr_pages, &folio->_large_mapcount); > break; > case RMAP_LEVEL_PMD: > + case RMAP_LEVEL_PUD: > if (PageAnonExclusive(page)) { > if (unlikely(maybe_pinned)) > return -EBUSY; > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index c4b45ad..e8985a4 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1336,21 +1336,19 @@ static void insert_pfn_pud(struct vm_area_struct > *vma, unsigned long addr, > struct mm_struct *mm = vma->vm_mm; > pgprot_t prot = vma->vm_page_prot; > pud_t entry; > - spinlock_t *ptl; > > - ptl = pud
Re: [PATCH 05/12] mm/memory: Add dax_insert_pfn
[ add s390 folks to comment on CONFIG_FS_DAX_LIMITED ] Alistair Popple wrote: > Currently to map a DAX page the DAX driver calls vmf_insert_pfn. This > creates a special devmap PTE entry for the pfn but does not take a > reference on the underlying struct page for the mapping. This is > because DAX page refcounts are treated specially, as indicated by the > presence of a devmap entry. > > To allow DAX page refcounts to be managed the same as normal page > refcounts introduce dax_insert_pfn. This will take a reference on the > underlying page much the same as vmf_insert_page, except it also > permits upgrading an existing mapping to be writable if > requested/possible. > > Signed-off-by: Alistair Popple > > --- > > Updates from v1: > > - Re-arrange code in insert_page_into_pte_locked() based on comments >from Jan Kara. > > - Call mkdrity/mkyoung for the mkwrite case, also suggested by Jan. > --- > include/linux/mm.h | 1 +- > mm/memory.c| 83 ++- > 2 files changed, 76 insertions(+), 8 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index b0ff06d..ae6d713 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -3463,6 +3463,7 @@ int vm_map_pages(struct vm_area_struct *vma, struct > page **pages, > unsigned long num); > int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages, > unsigned long num); > +vm_fault_t dax_insert_pfn(struct vm_fault *vmf, pfn_t pfn_t, bool write); > vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr, > unsigned long pfn); > vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long > addr, > diff --git a/mm/memory.c b/mm/memory.c > index d2785fb..368e15d 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2039,19 +2039,47 @@ static int validate_page_before_insert(struct > vm_area_struct *vma, > } > > static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t > *pte, > - unsigned long addr, struct page *page, pgprot_t prot) > + unsigned long addr, struct page *page, > + pgprot_t prot, bool mkwrite) This upgrade of insert_page_into_pte_locked() to handle write faults deserves to be its own patch with rationale along the lines of: "In preparation for using insert_page() for DAX, enhance insert_page_into_pte_locked() to handle establishing writable mappings. Recall that DAX returns VM_FAULT_NOPAGE after installing a PTE which bypasses the typical set_pte_range() in finish_fault." > { > struct folio *folio = page_folio(page); > + pte_t entry = ptep_get(pte); > pte_t pteval; > > - if (!pte_none(ptep_get(pte))) > - return -EBUSY; > + if (!pte_none(entry)) { > + if (!mkwrite) > + return -EBUSY; > + > + /* > + * For read faults on private mappings the PFN passed in may not > + * match the PFN we have mapped if the mapped PFN is a writeable > + * COW page. In the mkwrite case we are creating a writable PTE > + * for a shared mapping and we expect the PFNs to match. If they > + * don't match, we are likely racing with block allocation and > + * mapping invalidation so just skip the update. > + */ > + if (pte_pfn(entry) != page_to_pfn(page)) { > + WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry))); > + return -EFAULT; > + } > + entry = maybe_mkwrite(entry, vma); > + entry = pte_mkyoung(entry); > + if (ptep_set_access_flags(vma, addr, pte, entry, 1)) > + update_mmu_cache(vma, addr, pte); I was going to say that this should be creating a shared helper with insert_pfn(), but on closer look the mkwrite case in insert_pfn() is now dead code (almost, *grumbles about dcssblk*). So I would just mention that in the changelog for this standalone patch and then we can follow on with a cleanup like the patch at the bottom of this mail (untested). > + return 0; > + } > + > /* Ok, finally just insert the thing.. */ > pteval = mk_pte(page, prot); > if (unlikely(is_zero_folio(folio))) { > pteval = pte_mkspecial(pteval); > } else { > folio_get(folio); > + entry = mk_pte(page, prot); > + if (mkwrite) { > + entry = pte_mkyoung(entry); > + entry = maybe_mkwrite(pte_mkdirty(entry), vma); > + } > inc_mm_counter(vma->vm_mm, mm_counter_file(folio)); > folio_add_file_rmap_pte(folio, page, vma); > } > @@ -2060,7 +2088,7 @@ static int insert_page_into_pte_locked(struct > vm_area_struct *vma, pte_t *pte, > } > > static int in