Re: [PATCH v2 05/63] stddef: Introduce struct_group() helper macro
t; memcpy(&dst.thing, &src.thing, length); > do_something(dst.three); > > There are some rare cases where the resulting struct_group() needs > attributes added, so struct_group_attr() is also introduced to allow > for specifying struct attributes (e.g. __align(x) or __packed). > Additionally, there are places where such declarations would like to > have the struct be typed, so struct_group_typed() is added. > > Given there is a need for a handful of UAPI uses too, the underlying > __struct_group() macro has been defined in UAPI so it can be used there > too. > > Co-developed-by: Keith Packard > Signed-off-by: Keith Packard > Signed-off-by: Kees Cook > Acked-by: Gustavo A. R. Silva > Link: https://lore.kernel.org/lkml/20210728023217.GC35706@embeddedor > Enhanced-by: Rasmus Villemoes > Link: > https://lore.kernel.org/lkml/41183a98-bdb9-4ad6-7eab-5a7292a6d...@rasmusvillemoes.dk > Enhanced-by: Dan Williams Acked-by: Dan Williams
Re: [PATCH v2 06/63] cxl/core: Replace unions with struct_group()
On Tue, Aug 17, 2021 at 11:06 PM Kees Cook wrote: > > Use the newly introduced struct_group_typed() macro to clean up the > declaration of struct cxl_regs. > > Cc: Alison Schofield > Cc: Vishal Verma > Cc: Ira Weiny > Cc: Ben Widawsky > Cc: linux-...@vger.kernel.org > Suggested-by: Dan Williams Looks good and tests ok here: Reviewed-by: Dan Williams
Re: [PATCH v5 11/15] PCI: Obey iomem restrictions for procfs mmap
On Tue, Nov 3, 2020 at 1:28 PM Bjorn Helgaas wrote: > > On Fri, Oct 30, 2020 at 11:08:11AM +0100, Daniel Vetter wrote: > > There's three ways to access PCI BARs from userspace: /dev/mem, sysfs > > files, and the old proc interface. Two check against > > iomem_is_exclusive, proc never did. And with CONFIG_IO_STRICT_DEVMEM, > > this starts to matter, since we don't want random userspace having > > access to PCI BARs while a driver is loaded and using it. > > > > Fix this by adding the same iomem_is_exclusive() check we already have > > on the sysfs side in pci_mmap_resource(). > > > > References: 90a545e98126 ("restrict /dev/mem to idle io memory ranges") > > Signed-off-by: Daniel Vetter > > This is OK with me but it looks like IORESOURCE_EXCLUSIVE is currently > only used in a few places: > > e1000_probe() calls pci_request_selected_regions_exclusive(), > ne_pci_probe() calls pci_request_regions_exclusive(), > vmbus_allocate_mmio() calls request_mem_region_exclusive() > > which raises the question of whether it's worth keeping > IORESOURCE_EXCLUSIVE at all. I'm totally fine with removing it > completely. Now that CONFIG_IO_STRICT_DEVMEM upgrades IORESOURCE_BUSY to IORESOURCE_EXCLUSIVE semantics the latter has lost its meaning so I'd be in favor of removing it as well. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v5 11/15] PCI: Obey iomem restrictions for procfs mmap
On Wed, Nov 4, 2020 at 8:50 AM Bjorn Helgaas wrote: > > On Wed, Nov 04, 2020 at 09:44:04AM +0100, Daniel Vetter wrote: > > On Tue, Nov 3, 2020 at 11:09 PM Dan Williams > > wrote: > > > On Tue, Nov 3, 2020 at 1:28 PM Bjorn Helgaas wrote: > > > > On Fri, Oct 30, 2020 at 11:08:11AM +0100, Daniel Vetter wrote: > > > > > There's three ways to access PCI BARs from userspace: /dev/mem, sysfs > > > > > files, and the old proc interface. Two check against > > > > > iomem_is_exclusive, proc never did. And with CONFIG_IO_STRICT_DEVMEM, > > > > > this starts to matter, since we don't want random userspace having > > > > > access to PCI BARs while a driver is loaded and using it. > > > > > > > > > > Fix this by adding the same iomem_is_exclusive() check we already have > > > > > on the sysfs side in pci_mmap_resource(). > > > > > > > > > > References: 90a545e98126 ("restrict /dev/mem to idle io memory > > > > > ranges") > > > > > Signed-off-by: Daniel Vetter > > > > > > > > This is OK with me but it looks like IORESOURCE_EXCLUSIVE is currently > > > > only used in a few places: > > > > > > > > e1000_probe() calls pci_request_selected_regions_exclusive(), > > > > ne_pci_probe() calls pci_request_regions_exclusive(), > > > > vmbus_allocate_mmio() calls request_mem_region_exclusive() > > > > > > > > which raises the question of whether it's worth keeping > > > > IORESOURCE_EXCLUSIVE at all. I'm totally fine with removing it > > > > completely. > > > > > > Now that CONFIG_IO_STRICT_DEVMEM upgrades IORESOURCE_BUSY to > > > IORESOURCE_EXCLUSIVE semantics the latter has lost its meaning so I'd > > > be in favor of removing it as well. > > > > Still has some value since it enforces exclusive access even if the > > config isn't enabled, and iirc e1000 had some fun with userspace tools > > clobbering the firmware and bricking the chip. > > There's *some* value; I'm just skeptical since only three drivers use > it. > > IORESOURCE_EXCLUSIVE is from e8de1481fd71 ("resource: allow MMIO > exclusivity for device drivers"), and the commit message says this is > only active when CONFIG_STRICT_DEVMEM is set. I didn't check to see > whether that's still true. > > That commit adds a bunch of wrappers and "__"-prefixed functions to > pass the IORESOURCE_EXCLUSIVE flag around. That's a fair bit of > uglification for three drivers. > > > Another thing I kinda wondered, since pci maintainer is here: At least > > in drivers/gpu I see very few drivers explicitly requestion regions > > (this might be a historical artifact due to the shadow attach stuff > > before we had real modesetting drivers). And pci core doesn't do that > > either, even when a driver is bound. Is this intentional, or > > should/could we do better? Since drivers work happily without > > reserving regions I don't think "the drivers need to remember to do > > this" will ever really work out well. > > You're right, many drivers don't call pci_request_regions(). Maybe we > could do better, but I haven't looked into that recently. There is a > related note in Documentation/PCI/pci.rst that's been there for a long > time (it refers to "pci_request_resources()", which has never existed > AFAICT). I'm certainly open to proposals. It seems a bug that the kernel permits MMIO regions with side effects to be ioremap()'ed without request_mem_region() on the resource. I wonder how much log spam would happen if ioremap() reported whenever a non-IORESOURE_BUSY range was passed to it? The current state of affairs to trust *remap users to have claimed their remap target seems too ingrained to unwind now. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration
On Thu, Sep 2, 2021 at 1:18 AM Christoph Hellwig wrote: > > On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote: > > >>> It looks like I'm totally misunderstanding what you are adding here > > >>> then. Why do we need any special treatment at all for memory that > > >>> has normal struct pages and is part of the direct kernel map? > > >> The pages are like normal memory for purposes of mapping them in CPU > > >> page tables and for coherent access from the CPU. > > > That's the user page tables. What about the kernel direct map? > > > If there is a normal kernel struct page backing there really should > > > be no need for the pgmap. > > > > I'm not sure. The physical address ranges are in the UEFI system address > > map as special-purpose memory. Does Linux create the struct pages and > > kernel direct map for that without a pgmap call? I didn't see that last > > time I went digging through that code. > > So doing some googling finds a patch from Dan that claims to hand EFI > special purpose memory to the device dax driver. But when I try to > follow the version that got merged it looks it is treated simply as an > MMIO region to be claimed by drivers, which would not get a struct page. > > Dan, did I misunderstand how E820_TYPE_SOFT_RESERVED works? The original implementation of "soft reserve" support depended on the combination of the EFI special purpose memory type and the ACPI HMAT to define the device ranges. The requirement for ACPI HMAT was relaxed later with commit: 5ccac54f3e12 ACPI: HMAT: attach a device for each soft-reserved range The expectation is that system software policy can then either use the device interface, assign a portion of the reservation back to the page allocator, ignore the reservation altogether. Is this discussion asking for a way to assign this memory to the GPU driver to manage? device-dax already knows how to hand off to the page-allocator, seems reasonable for it to be able to hand-off to another driver.
Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount
On Thu, Oct 14, 2021 at 11:45 AM Matthew Wilcox wrote: > > > It would probably help if you cc'd Dan on this. Thanks. [..] > > On Thu, Oct 14, 2021 at 02:06:34PM -0300, Jason Gunthorpe wrote: > > On Thu, Oct 14, 2021 at 10:39:28AM -0500, Alex Sierra wrote: > > > From: Ralph Campbell > > > > > > ZONE_DEVICE struct pages have an extra reference count that complicates > > > the > > > code for put_page() and several places in the kernel that need to check > > > the > > > reference count to see that a page is not being used (gup, compaction, > > > migration, etc.). Clean up the code so the reference count doesn't need to > > > be treated specially for ZONE_DEVICE. > > > > > > Signed-off-by: Ralph Campbell > > > Signed-off-by: Alex Sierra > > > Reviewed-by: Christoph Hellwig > > > --- > > > v2: > > > AS: merged this patch in linux 5.11 version > > > > > > v5: > > > AS: add condition at try_grab_page to check for the zone device type, > > > while > > > page ref counter is checked less/equal to zero. In case of device zone, > > > pages > > > ref counter are initialized to zero. > > > > > > v7: > > > AS: fix condition at try_grab_page added at v5, is invalid. It supposed > > > to fix xfstests/generic/413 test, however, there's a known issue on > > > this test where DAX mapped area DIO to non-DAX expect to fail. > > > https://patchwork.kernel.org/project/fstests/patch/1489463960-3579-1-git-send-email-xz...@redhat.com > > > This condition was removed after rebase over patch series > > > https://lore.kernel.org/r/20210813044133.1536842-4-jhubb...@nvidia.com > > > --- > > > arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- > > > drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- > > > fs/dax.c | 4 +- > > > include/linux/dax.h| 2 +- > > > include/linux/memremap.h | 7 +-- > > > include/linux/mm.h | 11 > > > lib/test_hmm.c | 2 +- > > > mm/internal.h | 8 +++ > > > mm/memcontrol.c| 6 +-- > > > mm/memremap.c | 69 +++--- > > > mm/migrate.c | 5 -- > > > mm/page_alloc.c| 3 ++ > > > mm/swap.c | 45 ++--- > > > 13 files changed, 46 insertions(+), 120 deletions(-) > > > > Has anyone tested this with FSDAX? Does get_user_pages() on fsdax > > backed memory still work? > > > > What refcount value does the struct pages have when they are installed > > in the PTEs? Remember a 0 refcount will make all the get_user_pages() > > fail. > > > > I'm looking at the call path starting in ext4_punch_hole() and I would > > expect to see something manipulating the page ref count before > > the ext4_break_layouts() call path gets to the dax_page_unused() test. > > > > All I see is we go into unmap_mapping_pages() - that would normally > > put back the page references held by PTEs but insert_pfn() has this: > > > > if (pfn_t_devmap(pfn)) > > entry = pte_mkdevmap(pfn_t_pte(pfn, prot)); > > > > And: > > > > static inline pte_t pte_mkdevmap(pte_t pte) > > { > > return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP); > > } > > > > Which interacts with vm_normal_page(): > > > > if (pte_devmap(pte)) > > return NULL; > > > > To disable that refcounting? > > > > So... I have a feeling this will have PTEs pointing to 0 refcount > > pages? Unless FSDAX is !pte_devmap which is not the case, right? > > > > This seems further confirmed by this comment: > > > > /* > >* If we race get_user_pages_fast() here either we'll see the > >* elevated page count in the iteration and wait, or > >* get_user_pages_fast() will see that the page it took a reference > >* against is no longer mapped in the page tables and bail to the > >* get_user_pages() slow path. The slow path is protected by > >* pte_lock() and pmd_lock(). New references are not taken without > >* holding those locks, and unmap_mapping_pages() will not zero the > >* pte or pmd without holding the respective lock, so we are > >* guaranteed to either see new references or prevent new > >* references from being established. > >*/ > > > > Which seems to explain this scheme relies on unmap_mapping_pages() to > > fence GUP_fast, not on GUP_fast observing 0 refcounts when it should > > stop. > > > > This seems like it would be properly fixed by using normal page > > refcounting for PTEs - ie stop using special for these pages? > > > > Does anyone know why devmap is pte_special anyhow? It does not need to be special as mentioned here: https://lore.kernel.org/all/CAPcyv4iFeVDVPn6uc=aksyuvkiu3-fk-n16ijvzq3n8ot00...@mail.gmail.com/ The refcount dependencies also go away after this... https://lore.kernel.org/all/161604050866.1463742.7759521510383551
Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount
On Thu, Oct 14, 2021 at 4:06 PM Jason Gunthorpe wrote: > > On Thu, Oct 14, 2021 at 12:01:14PM -0700, Dan Williams wrote: > > > > Does anyone know why devmap is pte_special anyhow? > > > > It does not need to be special as mentioned here: > > > > https://lore.kernel.org/all/CAPcyv4iFeVDVPn6uc=aksyuvkiu3-fk-n16ijvzq3n8ot00...@mail.gmail.com/ > > I added a remark there > > Not special means more to me, it means devmap should do the refcounts > properly like normal memory pages. > > It means vm_normal_page should return !NULL and it means insert_page, > not insert_pfn should be used to install them in the PTE. VMAs should > not be MIXED MAP, but normal struct page maps. > > I think this change alone would fix all the refcount problems > everwhere in DAX and devmap. > > > The refcount dependencies also go away after this... > > > > https://lore.kernel.org/all/161604050866.1463742.7759521510383551055.st...@dwillia2-desk3.amr.corp.intel.com/ > > > > ...but you can see that patches 1 and 2 in that series depend on being > > able to guarantee that all mappings are invalidated when the undelying > > device that owns the pgmap goes away. > > If I have put everything together right this is because of what I > pointed to here. FS-DAX is installing 0 refcount pages into PTEs and > expecting that to work sanely. > > This means the page map cannot be removed until all the PTEs are fully > flushed, which buggily doesn't happen because of the missing unplug. > > However, this is all because nobody incrd a refcount to represent the > reference in the PTE and since this ment that 0 refcount pages were > wrongly stuffed into PTEs then devmap used the refcount == 1 hack to > unbreak GUP? > > So.. Is there some reason why devmap pages are trying so hard to avoid > sane refcounting??? I wouldn't put it that way. It's more that the original sin of ZONE_DEVICE that sought to reuse the lru field space, by never having a zero recount, then got layered upon and calcified in malignant ways. In the meantime surrounding infrastructure got decrustified. Work like the 'struct page' cleanup among other things, made it clearer and clearer over time that the original design choice needed to be fixed. > If the PTE itself holds the refcount (by not being special) then there > is no need for the pagemap stuff in GUP. pagemap already waits for > refs to go to 0 so the missing shootdown during nvdimm unplug will > cause pagemap to block until the address spaces are invalidated. IMHO > this is already better than the current buggy situation of allowing > continued PTE reference to memory that is now removed from the system. > > > For that to happen there needs to be communication back to the FS for > > device-gone / failure events. That work is in progress via this > > series: > > > > https://lore.kernel.org/all/20210924130959.2695749-1-ruansy.f...@fujitsu.com/ > > This is fine, but I don't think it should block fixing the mm side - > the end result here still cannot be 0 ref count pages installed in > PTEs. > > Fixing that does not depend on shootdown during device removal, right? > > It requires holding refcounts while pages are installed into address > spaces - and this lack is a direct cause of making the PTEs all > special and using insert_pfn and MIXED_MAP. The MIXED_MAP and insert_pfn were a holdover from page-less DAX, but now that we have page-available DAX, yes, we can skip the FS notification and just rely on typical refcounting and hanging until the FS has a chance to uninstall the PTEs. You're right, the FS notification is an improvement to the conversion, not a requirement. However, there still needs to be something in the gup-fast path to indicate that GUP_LONGTERM is not possible because the PTE represents a pfn that can not support typical page-cache behavior for truncate which is to just disconnect the page from the file and keep the page pinned indefinitely. I think the "no longterm" caveat would be the only remaining utility of PTE_DEVMAP after the above conversion to use typical page refcounts throughout DAX.
Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount
On Sat, Oct 16, 2021 at 9:39 AM Matthew Wilcox wrote: > > On Sat, Oct 16, 2021 at 12:44:50PM -0300, Jason Gunthorpe wrote: > > Assuming changing FSDAX is hard.. How would DAX people feel about just > > deleting the PUD/PMD support until it can be done with compound pages? > > I think there are customers who would find that an unacceptable answer :-) No, not given the number of end users that ask for help debugging PMD support.
Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount
On Sat, Oct 16, 2021 at 8:45 AM Jason Gunthorpe wrote: > > On Thu, Oct 14, 2021 at 06:37:35PM -0700, Dan Williams wrote: > > On Thu, Oct 14, 2021 at 4:06 PM Jason Gunthorpe wrote: > > > > > > On Thu, Oct 14, 2021 at 12:01:14PM -0700, Dan Williams wrote: > > > > > > Does anyone know why devmap is pte_special anyhow? > > > > > > > > It does not need to be special as mentioned here: > > > > > > > > https://lore.kernel.org/all/CAPcyv4iFeVDVPn6uc=aksyuvkiu3-fk-n16ijvzq3n8ot00...@mail.gmail.com/ > > > > > > I added a remark there > > > > > > Not special means more to me, it means devmap should do the refcounts > > > properly like normal memory pages. > > > > > > It means vm_normal_page should return !NULL and it means insert_page, > > > not insert_pfn should be used to install them in the PTE. VMAs should > > > not be MIXED MAP, but normal struct page maps. > > > > > > I think this change alone would fix all the refcount problems > > > everwhere in DAX and devmap. > > > > > > > The refcount dependencies also go away after this... > > > > > > > > https://lore.kernel.org/all/161604050866.1463742.7759521510383551055.st...@dwillia2-desk3.amr.corp.intel.com/ > > > > > > > > ...but you can see that patches 1 and 2 in that series depend on being > > > > able to guarantee that all mappings are invalidated when the undelying > > > > device that owns the pgmap goes away. > > > > > > If I have put everything together right this is because of what I > > > pointed to here. FS-DAX is installing 0 refcount pages into PTEs and > > > expecting that to work sanely. > > > > > > This means the page map cannot be removed until all the PTEs are fully > > > flushed, which buggily doesn't happen because of the missing unplug. > > > > > > However, this is all because nobody incrd a refcount to represent the > > > reference in the PTE and since this ment that 0 refcount pages were > > > wrongly stuffed into PTEs then devmap used the refcount == 1 hack to > > > unbreak GUP? > > > > > > So.. Is there some reason why devmap pages are trying so hard to avoid > > > sane refcounting??? > > > > I wouldn't put it that way. It's more that the original sin of > > ZONE_DEVICE that sought to reuse the lru field space, by never having > > a zero recount, then got layered upon and calcified in malignant ways. > > In the meantime surrounding infrastructure got decrustified. Work like > > the 'struct page' cleanup among other things, made it clearer and > > clearer over time that the original design choice needed to be fixed. > > So, there used to be some reason, but with the current code > arrangement it is not the case? This is why it looks so strange when > reading it.. > > AFIACT we are good on the LRU stuff now? Eg release_pages() does not > use page->lru for is_zone_device_page()? Yes, the lru collision was fixed by the 'struct page' cleanup. > > > The MIXED_MAP and insert_pfn were a holdover from page-less DAX, but > > now that we have page-available DAX, yes, we can skip the FS > > notification and just rely on typical refcounting and hanging until > > the FS has a chance to uninstall the PTEs. You're right, the FS > > notification is an improvement to the conversion, not a requirement. > > It all sounds so simple. I looked at this for a good long time and the > first major roadblock is huge pages. > > The mm side is designed around THP which puts a single high order page > into the PUD/PMD such that the mm only has to juggle one page. This a > very sane and reasonable thing to do. > > DAX is stuffing arrays of 4k pages into the PUD/PMDs. Aligning with > THP would make using normal refconting much simpler. I looked at > teaching the mm core to deal with page arrays - it is certainly > doable, but it is quite inefficient and ugly mm code. THP does not support PUD, and neither does FSDAX, so it's only PMDs we need to worry about. > > So, can we fix DAX and TTM - the only uses of PUD/PMDs I could find? > > Joao has a series that does this to device-dax: > > https://lore.kernel.org/all/20210827145819.16471-1-joao.m.mart...@oracle.com/ That assumes there's never any need to fracture a huge page which FSDAX could not support unless the filesystem was built with 2MB block size. > TTM is kind of broken already but does have a struct page, and it is > probably already a high order one. Maybe it is OK? I will ask Tho
Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount
On Mon, Oct 18, 2021 at 11:26 AM Jason Gunthorpe wrote: > > On Sun, Oct 17, 2021 at 11:35:35AM -0700, Dan Williams wrote: > > > > DAX is stuffing arrays of 4k pages into the PUD/PMDs. Aligning with > > > THP would make using normal refconting much simpler. I looked at > > > teaching the mm core to deal with page arrays - it is certainly > > > doable, but it is quite inefficient and ugly mm code. > > > > THP does not support PUD, and neither does FSDAX, so it's only PMDs we > > need to worry about. > > device-dax uses PUD, along with TTM, they are the only places. I'm not > sure TTM is a real place though. I was setting device-dax aside because it can use Joao's changes to get compound-page support. > > > > So, can we fix DAX and TTM - the only uses of PUD/PMDs I could find? > > > > > > Joao has a series that does this to device-dax: > > > > > > https://lore.kernel.org/all/20210827145819.16471-1-joao.m.mart...@oracle.com/ > > > > That assumes there's never any need to fracture a huge page which > > FSDAX could not support unless the filesystem was built with 2MB block > > size. > > As I understand things, something like FSDAX post-folio should > generate maximal compound pages for extents in the page cache that are > physically contiguous. > > A high order folio can be placed in any lower order in the page > tables, so we never have to fracture it, unless the underlying page > are moved around - which requires an unmap_mapping_range() cycle.. That would be useful to disconnect the compound-page size from the page-table-entry installed for the page. However, don't we need typical compound page fracturing in the near term until folios move ahead? > > > > Assuming changing FSDAX is hard.. How would DAX people feel about just > > > deleting the PUD/PMD support until it can be done with compound pages? > > > > There are end users that would notice the PMD regression, and I think > > FSDAX PMDs with proper compound page metadata is on the same order of > > work as fixing the refcount. > > Hmm, I don't know.. I sketched out the refcount stuff and the code is > OK but ugly and will add a conditional to some THP cases That reminds me that there are several places that do: pmd_devmap(pmd) || pmd_trans_huge(pmd) ...for the common cases where a THP and DEVMAP page are equivalent, but there are a few places where those paths are not shared when the THP path expects that the page came from the page allocator. So while DEVMAP is not needed in GUP after this conversion, there still needs to be an audit of when THP needs to be careful of DAX mappings. > On the other hand, making THP unmap cases a bit slower is probably a > net win compared to making put_page a bit slower.. Considering unmap > is already quite heavy. FSDAX eventually learned how to replace 'struct page' with xarray for several paths, so "how hard could it be" (/famous last words) to replace xarray with 'struct page'? I think the end result will be cleaner, but yes, I expect some dragons in the conversion. > > > > 4) Ask what the pgmap owner wants to do: > > > > > > if (head->pgmap->deny_foll_longterm) > > > return FAIL > > > > The pgmap itself does not know, but the "holder" could specify this > > policy. > > Here I imagine the thing that creates the pgmap would specify the > policy it wants. In most cases the policy is tightly coupled to what > the free function in the the provided dev_pagemap_ops does.. The thing that creates the pgmap is the device-driver, and device-driver does not implement truncate or reclaim. It's not until the FS mounts that the pgmap needs to start enforcing pin lifetime guarantees. > > > Which is in line with the 'dax_holder_ops' concept being introduced > > for reverse mapping support. I.e. when the FS claims the dax-device > > it can specify at that point that it wants to forbid longterm. > > Which is a reasonable refinment if we think there are cases where two > nvdim users would want different things. > It's already the case that device-dax does not enforce transient pin lifetimes. > Anyhow, I'm wondering on a way forward. There are many balls in the > air, all linked: > - Joao's compound page support for device_dax and more > - Alex's DEVICE_COHERENT I have not seen these patches. /me notices no MAINTAINERS mention for include/linux/memremap.h > - The refcount normalization > - Removing the pgmap test from GUP > - Removing the need for the PUD/PMD/PTE special bit > - Removing the need for the PUD/PMD/PTE devmap bit It's no
Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount
On Tue, Oct 19, 2021 at 9:02 AM Jason Gunthorpe wrote: > > On Tue, Oct 19, 2021 at 04:13:34PM +0100, Joao Martins wrote: > > On 10/19/21 00:06, Jason Gunthorpe wrote: > > > On Mon, Oct 18, 2021 at 12:37:30PM -0700, Dan Williams wrote: > > > > > >>> device-dax uses PUD, along with TTM, they are the only places. I'm not > > >>> sure TTM is a real place though. > > >> > > >> I was setting device-dax aside because it can use Joao's changes to > > >> get compound-page support. > > > > > > Ideally, but that ideas in that patch series have been floating around > > > for a long time now.. > > > > > The current status of the series misses a Rb on patches 6,7,10,12-14. > > Well, patch 8 too should now drop its tag, considering the latest > > discussion. > > > > If it helps moving things forward I could split my series further into: > > > > 1) the compound page introduction (patches 1-7) of my aforementioned series > > 2) vmemmap deduplication for memory gains (patches 9-14) > > 3) gup improvements (patch 8 and gup-slow improvements) > > I would split it, yes.. > > I think we can see a general consensus that making compound_head/etc > work consistently with how THP uses it will provide value and > opportunity for optimization going forward. > > > Whats the benefit between preventing longterm at start > > versus only after mounting the filesystem? Or is the intended future purpose > > to pass more context into an holder potential future callback e.g. nack > > longterm > > pins on a page basis? > > I understood Dan's remark that the device-dax path allows > FOLL_LONGTERM and the FSDAX path does not ? > > Which, IIRC, today is signaled basd on vma properties and in all cases > fast-gup is denied. Yeah, I forgot that 7af75561e171 eliminated any possibility of longterm-gup-fast for device-dax, let's not disturb that status quo. > > Maybe we can start by at least not add any flags and just prevent > > FOLL_LONGTERM on fsdax -- which I guess was the original purpose of > > commit 7af75561e171 ("mm/gup: add FOLL_LONGTERM capability to GUP fast"). > > This patch (which I can formally send) has a sketch of that (below scissors > > mark): > > > > https://lore.kernel.org/linux-mm/6a18179e-65f7-367d-89a9-d5162f10f...@oracle.com/ > > Yes, basically, whatever test we want for 'deny fast gup foll > longterm' is fine. > > Personally I'd like to see us move toward a set of flag specifying > each special behavior and not a collection of types that imply special > behaviors. > > Eg we have at least: > - Block gup fast on foll_longterm > - Capture the refcount ==1 and use the pgmap free hook >(confusingly called page_is_devmap_managed()) > - Always use a swap entry > - page->index/mapping are used in the usual file based way? > > Probably more things.. Yes, agree with the principle of reducing type-implied special casing.
Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount
On Wed, Oct 20, 2021 at 10:09 AM Joao Martins wrote: > > On 10/19/21 20:21, Dan Williams wrote: > > On Tue, Oct 19, 2021 at 9:02 AM Jason Gunthorpe wrote: > >> > >> On Tue, Oct 19, 2021 at 04:13:34PM +0100, Joao Martins wrote: > >>> On 10/19/21 00:06, Jason Gunthorpe wrote: > >>>> On Mon, Oct 18, 2021 at 12:37:30PM -0700, Dan Williams wrote: > >>>> > >>>>>> device-dax uses PUD, along with TTM, they are the only places. I'm not > >>>>>> sure TTM is a real place though. > >>>>> > >>>>> I was setting device-dax aside because it can use Joao's changes to > >>>>> get compound-page support. > >>>> > >>>> Ideally, but that ideas in that patch series have been floating around > >>>> for a long time now.. > >>>> > >>> The current status of the series misses a Rb on patches 6,7,10,12-14. > >>> Well, patch 8 too should now drop its tag, considering the latest > >>> discussion. > >>> > >>> If it helps moving things forward I could split my series further into: > >>> > >>> 1) the compound page introduction (patches 1-7) of my aforementioned > >>> series > >>> 2) vmemmap deduplication for memory gains (patches 9-14) > >>> 3) gup improvements (patch 8 and gup-slow improvements) > >> > >> I would split it, yes.. > >> > >> I think we can see a general consensus that making compound_head/etc > >> work consistently with how THP uses it will provide value and > >> opportunity for optimization going forward. > >> > > I'll go do that. Meanwhile, I'll wait a couple days for Dan to review the > dax subsystem patches (6 & 7), or otherwise send them over. > > >>> Whats the benefit between preventing longterm at start > >>> versus only after mounting the filesystem? Or is the intended future > >>> purpose > >>> to pass more context into an holder potential future callback e.g. nack > >>> longterm > >>> pins on a page basis? > >> > >> I understood Dan's remark that the device-dax path allows > >> FOLL_LONGTERM and the FSDAX path does not ? > >> > >> Which, IIRC, today is signaled basd on vma properties and in all cases > >> fast-gup is denied. > > > > Yeah, I forgot that 7af75561e171 eliminated any possibility of > > longterm-gup-fast for device-dax, let's not disturb that status quo. > > > I am slightly confused by this comment -- the status quo is what we are > questioning here -- And we talked about changing that in the past too > (thread below), that longterm-gup-fast was an oversight that that commit > was only applicable to fsdax. [Maybe this is just my english confusion] No, you have it correct. However that "regression" has gone unnoticed, so unless there is data showing that gup-fast on device-dax is critical for longterm page pinning workflows I'm ok for longterm to continue to trigger gup-slow.
Re: [PATCH v4 0/9] mmu notifier provide context informations
On Wed, Jan 23, 2019 at 2:23 PM wrote: > > From: Jérôme Glisse > > Hi Andrew, i see that you still have my event patch in you queue [1]. > This patchset replace that single patch and is broken down in further > step so that it is easier to review and ascertain that no mistake were > made during mechanical changes. Here are the step: > > Patch 1 - add the enum values > Patch 2 - coccinelle semantic patch to convert all call site of > mmu_notifier_range_init to default enum value and also > to passing down the vma when it is available > Patch 3 - update many call site to more accurate enum values > Patch 4 - add the information to the mmu_notifier_range struct > Patch 5 - helper to test if a range is updated to read only > > All the remaining patches are update to various driver to demonstrate > how this new information get use by device driver. I build tested > with make all and make all minus everything that enable mmu notifier > ie building with MMU_NOTIFIER=no. Also tested with some radeon,amd > gpu and intel gpu. > > If they are no objections i believe best plan would be to merge the > the first 5 patches (all mm changes) through your queue for 5.1 and > then to delay driver update to each individual driver tree for 5.2. > This will allow each individual device driver maintainer time to more > thouroughly test this more then my own testing. > > Note that i also intend to use this feature further in nouveau and > HMM down the road. I also expect that other user like KVM might be > interested into leveraging this new information to optimize some of > there secondary page table invalidation. "Down the road" users should introduce the functionality they want to consume. The common concern with preemptively including forward-looking infrastructure is realizing later that the infrastructure is not needed, or needs changing. If it has no current consumer, leave it out. > > Here is an explaination on the rational for this patchset: > > > CPU page table update can happens for many reasons, not only as a result > of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also > as a result of kernel activities (memory compression, reclaim, migration, > ...). > > This patch introduce a set of enums that can be associated with each of > the events triggering a mmu notifier. Latter patches take advantages of > those enum values. > > - UNMAP: munmap() or mremap() > - CLEAR: page table is cleared (migration, compaction, reclaim, ...) > - PROTECTION_VMA: change in access protections for the range > - PROTECTION_PAGE: change in access protections for page in the range > - SOFT_DIRTY: soft dirtyness tracking > > Being able to identify munmap() and mremap() from other reasons why the > page table is cleared is important to allow user of mmu notifier to > update their own internal tracking structure accordingly (on munmap or > mremap it is not longer needed to track range of virtual address as it > becomes invalid). The only context information consumed in this patch set is MMU_NOTIFY_PROTECTION_VMA. What is the practical benefit of these "optimize out the case when a range is updated to read only" optimizations? Any numbers to show this is worth the code thrash? > > [1] > https://www.ozlabs.org/~akpm/mmotm/broken-out/mm-mmu_notifier-contextual-information-for-event-triggering-invalidation-v2.patch > > Cc: Christian König > Cc: Jan Kara > Cc: Felix Kuehling > Cc: Jason Gunthorpe > Cc: Andrew Morton > Cc: Matthew Wilcox > Cc: Ross Zwisler > Cc: Dan Williams > Cc: Paolo Bonzini > Cc: Radim Krčmář > Cc: Michal Hocko > Cc: Ralph Campbell > Cc: John Hubbard > Cc: k...@vger.kernel.org > Cc: dri-devel@lists.freedesktop.org > Cc: linux-r...@vger.kernel.org > Cc: linux-fsde...@vger.kernel.org > Cc: Arnd Bergmann > > Jérôme Glisse (9): > mm/mmu_notifier: contextual information for event enums > mm/mmu_notifier: contextual information for event triggering > invalidation > mm/mmu_notifier: use correct mmu_notifier events for each invalidation > mm/mmu_notifier: pass down vma and reasons why mmu notifier is > happening > mm/mmu_notifier: mmu_notifier_range_update_to_read_only() helper > gpu/drm/radeon: optimize out the case when a range is updated to read > only > gpu/drm/amdgpu: optimize out the case when a range is updated to read > only > gpu/drm/i915: optimize out the case when a range is updated to read > only > RDMA/umem_odp: optimize out the case when a range is updated to read > only > > drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 13 > drivers/gpu/drm/i915/i915_gem_userptr.c | 16 +++
Re: [PATCH v4 0/9] mmu notifier provide context informations
On Wed, Jan 23, 2019 at 3:05 PM Jerome Glisse wrote: > > On Wed, Jan 23, 2019 at 02:54:40PM -0800, Dan Williams wrote: > > On Wed, Jan 23, 2019 at 2:23 PM wrote: > > > > > > From: Jérôme Glisse > > > > > > Hi Andrew, i see that you still have my event patch in you queue [1]. > > > This patchset replace that single patch and is broken down in further > > > step so that it is easier to review and ascertain that no mistake were > > > made during mechanical changes. Here are the step: > > > > > > Patch 1 - add the enum values > > > Patch 2 - coccinelle semantic patch to convert all call site of > > > mmu_notifier_range_init to default enum value and also > > > to passing down the vma when it is available > > > Patch 3 - update many call site to more accurate enum values > > > Patch 4 - add the information to the mmu_notifier_range struct > > > Patch 5 - helper to test if a range is updated to read only > > > > > > All the remaining patches are update to various driver to demonstrate > > > how this new information get use by device driver. I build tested > > > with make all and make all minus everything that enable mmu notifier > > > ie building with MMU_NOTIFIER=no. Also tested with some radeon,amd > > > gpu and intel gpu. > > > > > > If they are no objections i believe best plan would be to merge the > > > the first 5 patches (all mm changes) through your queue for 5.1 and > > > then to delay driver update to each individual driver tree for 5.2. > > > This will allow each individual device driver maintainer time to more > > > thouroughly test this more then my own testing. > > > > > > Note that i also intend to use this feature further in nouveau and > > > HMM down the road. I also expect that other user like KVM might be > > > interested into leveraging this new information to optimize some of > > > there secondary page table invalidation. > > > > "Down the road" users should introduce the functionality they want to > > consume. The common concern with preemptively including > > forward-looking infrastructure is realizing later that the > > infrastructure is not needed, or needs changing. If it has no current > > consumer, leave it out. > > This patchset already show that this is useful, what more can i do ? > I know i will use this information, in nouveau for memory policy we > allocate our own structure for every vma the GPU ever accessed or that > userspace hinted we should set a policy for. Right now with existing > mmu notifier i _must_ free those structure because i do not know if > the invalidation is an munmap or something else. So i am loosing > important informations and unecessarly free struct that i will have > to re-allocate just couple jiffies latter. That's one way i am using > this. Understood, but that still seems to say stage the core support when the nouveau enabling is ready. > The other way is to optimize GPU page table update just like i > am doing with all the patches to RDMA/ODP and various GPU drivers. Yes. > > > > > Here is an explaination on the rational for this patchset: > > > > > > > > > CPU page table update can happens for many reasons, not only as a result > > > of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also > > > as a result of kernel activities (memory compression, reclaim, migration, > > > ...). > > > > > > This patch introduce a set of enums that can be associated with each of > > > the events triggering a mmu notifier. Latter patches take advantages of > > > those enum values. > > > > > > - UNMAP: munmap() or mremap() > > > - CLEAR: page table is cleared (migration, compaction, reclaim, ...) > > > - PROTECTION_VMA: change in access protections for the range > > > - PROTECTION_PAGE: change in access protections for page in the range > > > - SOFT_DIRTY: soft dirtyness tracking > > > > > > Being able to identify munmap() and mremap() from other reasons why the > > > page table is cleared is important to allow user of mmu notifier to > > > update their own internal tracking structure accordingly (on munmap or > > > mremap it is not longer needed to track range of virtual address as it > > > becomes invalid). > > > > The only context information consumed in this patch set is > > MMU_NOTIFY_PROTECTION_VMA. > > > > What is the practical benefit of these "optimize out
Re: [PATCH v5 0/9] mmu notifier provide context informations
On Tue, Feb 19, 2019 at 12:04 PM wrote: > > From: Jérôme Glisse > > Since last version [4] i added the extra bits needed for the change_pte > optimization (which is a KSM thing). Here i am not posting users of > this, they will be posted to the appropriate sub-systems (KVM, GPU, > RDMA, ...) once this serie get upstream. If you want to look at users > of this see [5] [6]. If this gets in 5.1 then i will be submitting > those users for 5.2 (including KVM if KVM folks feel comfortable with > it). The users look small and straightforward. Why not await acks and reviewed-by's for the users like a typical upstream submission and merge them together? Is all of the functionality of this infrastructure consumed by the proposed users? Last time I checked it was only a subset. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v5 0/9] mmu notifier provide context informations
On Tue, Feb 19, 2019 at 12:30 PM Jerome Glisse wrote: > > On Tue, Feb 19, 2019 at 12:15:55PM -0800, Dan Williams wrote: > > On Tue, Feb 19, 2019 at 12:04 PM wrote: > > > > > > From: Jérôme Glisse > > > > > > Since last version [4] i added the extra bits needed for the change_pte > > > optimization (which is a KSM thing). Here i am not posting users of > > > this, they will be posted to the appropriate sub-systems (KVM, GPU, > > > RDMA, ...) once this serie get upstream. If you want to look at users > > > of this see [5] [6]. If this gets in 5.1 then i will be submitting > > > those users for 5.2 (including KVM if KVM folks feel comfortable with > > > it). > > > > The users look small and straightforward. Why not await acks and > > reviewed-by's for the users like a typical upstream submission and > > merge them together? Is all of the functionality of this > > infrastructure consumed by the proposed users? Last time I checked it > > was only a subset. > > Yes pretty much all is use, the unuse case is SOFT_DIRTY and CLEAR > vs UNMAP. Both of which i intend to use. The RDMA folks already ack > the patches IIRC, so did radeon and amdgpu. I believe the i915 folks > were ok with it too. I do not want to merge things through Andrew > for all of this we discussed that in the past, merge mm bits through > Andrew in one release and bits that use things in the next release. Ok, I was trying to find the links to the acks on the mailing list, those references would address my concerns. I see no reason to rush SOFT_DIRTY and CLEAR ahead of the upstream user. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v5 0/9] mmu notifier provide context informations
On Tue, Feb 19, 2019 at 12:41 PM Jason Gunthorpe wrote: > > On Tue, Feb 19, 2019 at 03:30:33PM -0500, Jerome Glisse wrote: > > On Tue, Feb 19, 2019 at 12:15:55PM -0800, Dan Williams wrote: > > > On Tue, Feb 19, 2019 at 12:04 PM wrote: > > > > > > > > From: Jérôme Glisse > > > > > > > > Since last version [4] i added the extra bits needed for the change_pte > > > > optimization (which is a KSM thing). Here i am not posting users of > > > > this, they will be posted to the appropriate sub-systems (KVM, GPU, > > > > RDMA, ...) once this serie get upstream. If you want to look at users > > > > of this see [5] [6]. If this gets in 5.1 then i will be submitting > > > > those users for 5.2 (including KVM if KVM folks feel comfortable with > > > > it). > > > > > > The users look small and straightforward. Why not await acks and > > > reviewed-by's for the users like a typical upstream submission and > > > merge them together? Is all of the functionality of this > > > infrastructure consumed by the proposed users? Last time I checked it > > > was only a subset. > > > > Yes pretty much all is use, the unuse case is SOFT_DIRTY and CLEAR > > vs UNMAP. Both of which i intend to use. The RDMA folks already ack > > the patches IIRC, so did radeon and amdgpu. I believe the i915 folks > > were ok with it too. I do not want to merge things through Andrew > > for all of this we discussed that in the past, merge mm bits through > > Andrew in one release and bits that use things in the next release. > > It is usually cleaner for everyone to split patches like this, for > instance I always prefer to merge RDMA patches via RDMA when > possible. Less conflicts. > > The other somewhat reasonable option is to get acks and send your own > complete PR to Linus next week? That works OK for tree-wide changes. Yes, I'm not proposing that they be merged together, instead I'm just looking for the acked-by / reviewed-by tags even if those patches are targeting the next merge window. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v5 0/9] mmu notifier provide context informations
On Tue, Feb 19, 2019 at 12:58 PM Jerome Glisse wrote: > > On Tue, Feb 19, 2019 at 12:40:37PM -0800, Dan Williams wrote: > > On Tue, Feb 19, 2019 at 12:30 PM Jerome Glisse wrote: > > > > > > On Tue, Feb 19, 2019 at 12:15:55PM -0800, Dan Williams wrote: > > > > On Tue, Feb 19, 2019 at 12:04 PM wrote: > > > > > > > > > > From: Jérôme Glisse > > > > > > > > > > Since last version [4] i added the extra bits needed for the > > > > > change_pte > > > > > optimization (which is a KSM thing). Here i am not posting users of > > > > > this, they will be posted to the appropriate sub-systems (KVM, GPU, > > > > > RDMA, ...) once this serie get upstream. If you want to look at users > > > > > of this see [5] [6]. If this gets in 5.1 then i will be submitting > > > > > those users for 5.2 (including KVM if KVM folks feel comfortable with > > > > > it). > > > > > > > > The users look small and straightforward. Why not await acks and > > > > reviewed-by's for the users like a typical upstream submission and > > > > merge them together? Is all of the functionality of this > > > > infrastructure consumed by the proposed users? Last time I checked it > > > > was only a subset. > > > > > > Yes pretty much all is use, the unuse case is SOFT_DIRTY and CLEAR > > > vs UNMAP. Both of which i intend to use. The RDMA folks already ack > > > the patches IIRC, so did radeon and amdgpu. I believe the i915 folks > > > were ok with it too. I do not want to merge things through Andrew > > > for all of this we discussed that in the past, merge mm bits through > > > Andrew in one release and bits that use things in the next release. > > > > Ok, I was trying to find the links to the acks on the mailing list, > > those references would address my concerns. I see no reason to rush > > SOFT_DIRTY and CLEAR ahead of the upstream user. > > I intend to post user for those in next couple weeks for 5.2 HMM bits. > So user for this (CLEAR/UNMAP/SOFTDIRTY) will definitly materialize in > time for 5.2. > > ACKS AMD/RADEON https://lkml.org/lkml/2019/2/1/395 > ACKS RDMA https://lkml.org/lkml/2018/12/6/1473 Nice, thanks! > For KVM Andrea Arcangeli seems to like the whole idea to restore the > change_pte optimization but i have not got ACK from Radim or Paolo, > however given the small performance improvement figure i get with it > i do not see while they would not ACK. Sure, but no need to push ahead without that confirmation, right? At least for the piece that KVM cares about, maybe that's already covered in the infrastructure RDMA and RADEON are using? ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: dev_pagemap related cleanups
On Thu, Jun 13, 2019 at 2:43 AM Christoph Hellwig wrote: > > Hi Dan, Jérôme and Jason, > > below is a series that cleans up the dev_pagemap interface so that > it is more easily usable, which removes the need to wrap it in hmm > and thus allowing to kill a lot of code > > Diffstat: > > 22 files changed, 245 insertions(+), 802 deletions(-) Hooray! > Git tree: > > git://git.infradead.org/users/hch/misc.git hmm-devmem-cleanup I just realized this collides with the dev_pagemap release rework in Andrew's tree (commit ids below are from next.git and are not stable) 4422ee8476f0 mm/devm_memremap_pages: fix final page put race 771f0714d0dc PCI/P2PDMA: track pgmap references per resource, not globally af37085de906 lib/genalloc: introduce chunk owners e0047ff8aa77 PCI/P2PDMA: fix the gen_pool_add_virt() failure path 0315d47d6ae9 mm/devm_memremap_pages: introduce devm_memunmap_pages 216475c7eaa8 drivers/base/devres: introduce devm_release_action() CONFLICT (content): Merge conflict in tools/testing/nvdimm/test/iomap.c CONFLICT (content): Merge conflict in mm/hmm.c CONFLICT (content): Merge conflict in kernel/memremap.c CONFLICT (content): Merge conflict in include/linux/memremap.h CONFLICT (content): Merge conflict in drivers/pci/p2pdma.c CONFLICT (content): Merge conflict in drivers/nvdimm/pmem.c CONFLICT (content): Merge conflict in drivers/dax/device.c CONFLICT (content): Merge conflict in drivers/dax/dax-private.h Perhaps we should pull those out and resend them through hmm.git? It also turns out the nvdimm unit tests crash with this signature on that branch where base v5.2-rc3 passes: BUG: kernel NULL pointer dereference, address: 0008 [..] CPU: 15 PID: 1414 Comm: lt-libndctl Tainted: G OE 5.2.0-rc3+ #3399 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 RIP: 0010:percpu_ref_kill_and_confirm+0x1e/0x180 [..] Call Trace: release_nodes+0x234/0x280 device_release_driver_internal+0xe8/0x1b0 bus_remove_device+0xf2/0x160 device_del+0x166/0x370 unregister_dev_dax+0x23/0x50 release_nodes+0x234/0x280 device_release_driver_internal+0xe8/0x1b0 unbind_store+0x94/0x120 kernfs_fop_write+0xf0/0x1a0 vfs_write+0xb7/0x1b0 ksys_write+0x5c/0xd0 do_syscall_64+0x60/0x240 The crash bisects to: d8cc8bbe108c device-dax: use the dev_pagemap internal refcount ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 09/22] memremap: lift the devmap_enable manipulation into devm_memremap_pages
On Thu, Jun 13, 2019 at 12:35 PM Jason Gunthorpe wrote: > > On Thu, Jun 13, 2019 at 11:43:12AM +0200, Christoph Hellwig wrote: > > Just check if there is a ->page_free operation set and take care of the > > static key enable, as well as the put using device managed resources. > > diff --git a/mm/hmm.c b/mm/hmm.c > > index c76a1b5defda..6dc769feb2e1 100644 > > +++ b/mm/hmm.c > > @@ -1378,8 +1378,6 @@ struct hmm_devmem *hmm_devmem_add(const struct > > hmm_devmem_ops *ops, > > void *result; > > int ret; > > > > - dev_pagemap_get_ops(); > > - > > Where was the matching dev_pagemap_put_ops() for this hmm case? This > is a bug fix too? > It never existed. HMM turned on the facility and made everyone's put_page() operations slower regardless of whether HMM was in active use. > The nouveau driver is the only one to actually call this hmm function > and it does it as part of a probe function. > > Seems reasonable, however, in the unlikely event that it fails to init > 'dmem' the driver will retain a dev_pagemap_get_ops until it unloads. > This imbalance doesn't seem worth worrying about. Right, unless/until the overhead of checking for put_page() callbacks starts to hurt leaving pagemap_ops tied to lifetime of the driver load seems acceptable because who unbinds their GPU device at runtime? On the other hand it was simple enough for the pmem driver to drop the reference each time a device was unbound just to close the loop. > > Reviewed-by: Christoph Hellwig ...minor typo. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 08/22] memremap: pass a struct dev_pagemap to ->kill
On Thu, Jun 13, 2019 at 1:12 PM Logan Gunthorpe wrote: > > > > On 2019-06-13 3:43 a.m., Christoph Hellwig wrote: > > Passing the actual typed structure leads to more understandable code > > vs the actual references. > > Ha, ok, I originally suggested this to Dan when he introduced the > callback[1]. > > Reviewed-by: Logan Gunthorpe Reviewed-by: Dan Williams Reported-by: Logan Gunthorpe :) > > Logan > > [1] > https://lore.kernel.org/lkml/8f0cae82-130f-8a64-cfbd-fda5fd76b...@deltatee.com/T/#u > ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: dev_pagemap related cleanups
On Thu, Jun 13, 2019 at 1:18 PM Logan Gunthorpe wrote: > > > > On 2019-06-13 12:27 p.m., Dan Williams wrote: > > On Thu, Jun 13, 2019 at 2:43 AM Christoph Hellwig wrote: > >> > >> Hi Dan, Jérôme and Jason, > >> > >> below is a series that cleans up the dev_pagemap interface so that > >> it is more easily usable, which removes the need to wrap it in hmm > >> and thus allowing to kill a lot of code > >> > >> Diffstat: > >> > >> 22 files changed, 245 insertions(+), 802 deletions(-) > > > > Hooray! > > > >> Git tree: > >> > >> git://git.infradead.org/users/hch/misc.git hmm-devmem-cleanup > > > > I just realized this collides with the dev_pagemap release rework in > > Andrew's tree (commit ids below are from next.git and are not stable) > > > > 4422ee8476f0 mm/devm_memremap_pages: fix final page put race > > 771f0714d0dc PCI/P2PDMA: track pgmap references per resource, not globally > > af37085de906 lib/genalloc: introduce chunk owners > > e0047ff8aa77 PCI/P2PDMA: fix the gen_pool_add_virt() failure path > > 0315d47d6ae9 mm/devm_memremap_pages: introduce devm_memunmap_pages > > 216475c7eaa8 drivers/base/devres: introduce devm_release_action() > > > > CONFLICT (content): Merge conflict in tools/testing/nvdimm/test/iomap.c > > CONFLICT (content): Merge conflict in mm/hmm.c > > CONFLICT (content): Merge conflict in kernel/memremap.c > > CONFLICT (content): Merge conflict in include/linux/memremap.h > > CONFLICT (content): Merge conflict in drivers/pci/p2pdma.c > > CONFLICT (content): Merge conflict in drivers/nvdimm/pmem.c > > CONFLICT (content): Merge conflict in drivers/dax/device.c > > CONFLICT (content): Merge conflict in drivers/dax/dax-private.h > > > > Perhaps we should pull those out and resend them through hmm.git? > > Hmm, I've been waiting for those patches to get in for a little while now ;( Unless Andrew was going to submit as v5.2-rc fixes I think I should rebase / submit them on current hmm.git and then throw these cleanups from Christoph on top? ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: dev_pagemap related cleanups
On Thu, Jun 13, 2019 at 11:14 PM Christoph Hellwig wrote: > > On Thu, Jun 13, 2019 at 11:27:39AM -0700, Dan Williams wrote: > > It also turns out the nvdimm unit tests crash with this signature on > > that branch where base v5.2-rc3 passes: > > How do you run that test? This is the unit test suite that gets kicked off by running "make check" from the ndctl source repository. In this case it requires the nfit_test set of modules to create a fake nvdimm environment. The setup instructions are in the README, but feel free to send me branches and I can kick off a test. One of these we'll get around to making it automated for patch submissions to the linux-nvdimm mailing list. https://github.com/pmem/ndctl/blob/master/README.md ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: dev_pagemap related cleanups
On Sat, Jun 15, 2019 at 1:34 AM Christoph Hellwig wrote: > > On Fri, Jun 14, 2019 at 06:14:45PM -0700, Dan Williams wrote: > > On Thu, Jun 13, 2019 at 11:14 PM Christoph Hellwig wrote: > > > > > > On Thu, Jun 13, 2019 at 11:27:39AM -0700, Dan Williams wrote: > > > > It also turns out the nvdimm unit tests crash with this signature on > > > > that branch where base v5.2-rc3 passes: > > > > > > How do you run that test? > > > > This is the unit test suite that gets kicked off by running "make > > check" from the ndctl source repository. In this case it requires the > > nfit_test set of modules to create a fake nvdimm environment. > > > > The setup instructions are in the README, but feel free to send me > > branches and I can kick off a test. One of these we'll get around to > > making it automated for patch submissions to the linux-nvdimm mailing > > list. > > Oh, now I remember, and that was the bummer as anything requiring modules > just does not fit at all into my normal test flows that just inject > kernel images and use otherwise static images. Yeah... although we do have some changes being proposed from non-x86 devs to allow a subset of the tests to run without the nfit_test modules: https://patchwork.kernel.org/patch/10980779/ ...so this prompts me to go review that patch. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 06/25] mm: factor out a devm_request_free_mem_region helper
On Mon, Jun 17, 2019 at 5:27 AM Christoph Hellwig wrote: > > Keep the physical address allocation that hmm_add_device does with the > rest of the resource code, and allow future reuse of it without the hmm > wrapper. > > Signed-off-by: Christoph Hellwig > Reviewed-by: Jason Gunthorpe > Reviewed-by: John Hubbard > --- > include/linux/ioport.h | 2 ++ > kernel/resource.c | 39 +++ > mm/hmm.c | 33 - > 3 files changed, 45 insertions(+), 29 deletions(-) > > diff --git a/include/linux/ioport.h b/include/linux/ioport.h > index da0ebaec25f0..76a33ae3bf6c 100644 > --- a/include/linux/ioport.h > +++ b/include/linux/ioport.h > @@ -286,6 +286,8 @@ static inline bool resource_overlaps(struct resource *r1, > struct resource *r2) > return (r1->start <= r2->end && r1->end >= r2->start); > } > > +struct resource *devm_request_free_mem_region(struct device *dev, > + struct resource *base, unsigned long size); This appears to need a 'static inline' helper stub in the CONFIG_DEVICE_PRIVATE=n case, otherwise this compile error triggers: ld: mm/hmm.o: in function `hmm_devmem_add': /home/dwillia2/git/linux/mm/hmm.c:1427: undefined reference to `devm_request_free_mem_region' ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 08/25] memremap: move dev_pagemap callbacks into a separate structure
On Mon, Jun 17, 2019 at 5:27 AM Christoph Hellwig wrote: > > The dev_pagemap is a growing too many callbacks. Move them into a > separate ops structure so that they are not duplicated for multiple > instances, and an attacker can't easily overwrite them. > > Signed-off-by: Christoph Hellwig > Reviewed-by: Logan Gunthorpe > Reviewed-by: Jason Gunthorpe > --- > drivers/dax/device.c | 11 ++ > drivers/dax/pmem/core.c | 2 +- > drivers/nvdimm/pmem.c | 19 +--- > drivers/pci/p2pdma.c | 9 +--- > include/linux/memremap.h | 36 +-- > kernel/memremap.c | 18 > mm/hmm.c | 10 ++--- > tools/testing/nvdimm/test/iomap.c | 9 > 8 files changed, 65 insertions(+), 49 deletions(-) > [..] > diff --git a/tools/testing/nvdimm/test/iomap.c > b/tools/testing/nvdimm/test/iomap.c > index 219dd0a1cb08..a667d974155e 100644 > --- a/tools/testing/nvdimm/test/iomap.c > +++ b/tools/testing/nvdimm/test/iomap.c > @@ -106,11 +106,10 @@ EXPORT_SYMBOL(__wrap_devm_memremap); > > static void nfit_test_kill(void *_pgmap) > { > - struct dev_pagemap *pgmap = _pgmap; Whoops, needed to keep this line to avoid: tools/testing/nvdimm/test/iomap.c:109:11: error: ‘pgmap’ undeclared (first use in this function); did you mean ‘_pgmap’? ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 07/25] memremap: validate the pagemap type passed to devm_memremap_pages
On Mon, Jun 17, 2019 at 5:27 AM Christoph Hellwig wrote: > > Most pgmap types are only supported when certain config options are > enabled. Check for a type that is valid for the current configuration > before setting up the pagemap. > > Signed-off-by: Christoph Hellwig > --- > kernel/memremap.c | 27 +++ > 1 file changed, 27 insertions(+) > > diff --git a/kernel/memremap.c b/kernel/memremap.c > index 6e1970719dc2..6a2dd31a6250 100644 > --- a/kernel/memremap.c > +++ b/kernel/memremap.c > @@ -157,6 +157,33 @@ void *devm_memremap_pages(struct device *dev, struct > dev_pagemap *pgmap) > pgprot_t pgprot = PAGE_KERNEL; > int error, nid, is_ram; > > + switch (pgmap->type) { > + case MEMORY_DEVICE_PRIVATE: > + if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) { > + WARN(1, "Device private memory not supported\n"); > + return ERR_PTR(-EINVAL); > + } > + break; > + case MEMORY_DEVICE_PUBLIC: > + if (!IS_ENABLED(CONFIG_DEVICE_PUBLIC)) { > + WARN(1, "Device public memory not supported\n"); > + return ERR_PTR(-EINVAL); > + } > + break; > + case MEMORY_DEVICE_FS_DAX: > + if (!IS_ENABLED(CONFIG_ZONE_DEVICE) || > + IS_ENABLED(CONFIG_FS_DAX_LIMITED)) { > + WARN(1, "File system DAX not supported\n"); > + return ERR_PTR(-EINVAL); > + } > + break; > + case MEMORY_DEVICE_PCI_P2PDMA: Need a lead in patch that introduces MEMORY_DEVICE_DEVDAX, otherwise: Invalid pgmap type 0 WARNING: CPU: 6 PID: 1316 at kernel/memremap.c:183 devm_memremap_pages+0x1d8/0x700 [..] RIP: 0010:devm_memremap_pages+0x1d8/0x700 [..] Call Trace: dev_dax_probe+0xc7/0x1e0 [device_dax] really_probe+0xef/0x390 driver_probe_device+0xb4/0x100 device_driver_attach+0x4f/0x60 ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 10/25] memremap: lift the devmap_enable manipulation into devm_memremap_pages
On Mon, Jun 17, 2019 at 5:28 AM Christoph Hellwig wrote: > > Just check if there is a ->page_free operation set and take care of the > static key enable, as well as the put using device managed resources. > Also check that a ->page_free is provided for the pgmaps types that > require it, and check for a valid type as well while we are at it. > > Note that this also fixes the fact that hmm never called > dev_pagemap_put_ops and thus would leave the slow path enabled forever, > even after a device driver unload or disable. > > Signed-off-by: Christoph Hellwig > --- > drivers/nvdimm/pmem.c | 23 +++-- > include/linux/mm.h| 10 > kernel/memremap.c | 57 ++- > mm/hmm.c | 2 -- > 4 files changed, 39 insertions(+), 53 deletions(-) > [..] > diff --git a/kernel/memremap.c b/kernel/memremap.c > index ba7156bd52d1..7272027fbdd7 100644 > --- a/kernel/memremap.c > +++ b/kernel/memremap.c [..] > @@ -190,6 +219,12 @@ void *devm_memremap_pages(struct device *dev, struct > dev_pagemap *pgmap) > return ERR_PTR(-EINVAL); > } > > + if (pgmap->type != MEMORY_DEVICE_PCI_P2PDMA) { Once we have MEMORY_DEVICE_DEVDAX then this check needs to be fixed up to skip that case as well, otherwise: Missing page_free method WARNING: CPU: 19 PID: 1518 at kernel/memremap.c:33 devm_memremap_pages+0x745/0x7d0 RIP: 0010:devm_memremap_pages+0x745/0x7d0 Call Trace: dev_dax_probe+0xc6/0x1e0 [device_dax] really_probe+0xef/0x390 ? driver_allows_async_probing+0x50/0x50 driver_probe_device+0xb4/0x100 ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 07/25] memremap: validate the pagemap type passed to devm_memremap_pages
On Mon, Jun 17, 2019 at 12:59 PM Christoph Hellwig wrote: > > On Mon, Jun 17, 2019 at 12:02:09PM -0700, Dan Williams wrote: > > Need a lead in patch that introduces MEMORY_DEVICE_DEVDAX, otherwise: > > Or maybe a MEMORY_DEVICE_DEFAULT = 0 shared by fsdax and p2pdma? I thought about that, but it seems is_pci_p2pdma_page() needs the distinction between the 2 types. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 08/25] memremap: move dev_pagemap callbacks into a separate structure
On Mon, Jun 17, 2019 at 12:59 PM Christoph Hellwig wrote: > > On Mon, Jun 17, 2019 at 10:51:35AM -0700, Dan Williams wrote: > > > - struct dev_pagemap *pgmap = _pgmap; > > > > Whoops, needed to keep this line to avoid: > > > > tools/testing/nvdimm/test/iomap.c:109:11: error: ‘pgmap’ undeclared > > (first use in this function); did you mean ‘_pgmap’? > > So I really shouldn't be tripping over this anymore, but can we somehow > this mess? > > - at least add it to the normal build system and kconfig deps instead >of stashing it away so that things like buildbot can build it? > - at least allow building it (under COMPILE_TEST) if needed even when >pmem.ko and friends are built in the kernel? Done: https://patchwork.kernel.org/patch/11000477/ ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 15/25] device-dax: use the dev_pagemap internal refcount
On Mon, Jun 17, 2019 at 5:28 AM Christoph Hellwig wrote: > > The functionality is identical to the one currently open coded in > device-dax. > > Signed-off-by: Christoph Hellwig > --- > drivers/dax/dax-private.h | 4 > drivers/dax/device.c | 43 --- > 2 files changed, 47 deletions(-) This needs the mock devm_memremap_pages() to setup the common percpu_ref. Incremental patch attached: From 875e71489c8485448a5b7df2d8a8b2ed77d2b555 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 18 Jun 2019 11:58:24 -0700 Subject: [PATCH] tools/testing/nvdimm: Support the 'internal' ref of dev_pagemap For users of the common percpu-ref implementation, like device-dax, arrange for nfit_test to initialize the common parameters. Signed-off-by: Dan Williams --- tools/testing/nvdimm/test/iomap.c | 41 --- 1 file changed, 32 insertions(+), 9 deletions(-) diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c index 3bc1c16c4ef9..9019dd8afbc1 100644 --- a/tools/testing/nvdimm/test/iomap.c +++ b/tools/testing/nvdimm/test/iomap.c @@ -108,8 +108,6 @@ static void nfit_test_kill(void *_pgmap) { struct dev_pagemap *pgmap = _pgmap; - WARN_ON(!pgmap || !pgmap->ref); - if (pgmap->ops && pgmap->ops->kill) pgmap->ops->kill(pgmap); else @@ -123,20 +121,45 @@ static void nfit_test_kill(void *_pgmap) } } +static void dev_pagemap_percpu_release(struct percpu_ref *ref) +{ + struct dev_pagemap *pgmap = + container_of(ref, struct dev_pagemap, internal_ref); + + complete(&pgmap->done); +} + void *__wrap_devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) { + int error; resource_size_t offset = pgmap->res.start; struct nfit_test_resource *nfit_res = get_nfit_res(offset); - if (nfit_res) { - int rc; + if (!nfit_res) + return devm_memremap_pages(dev, pgmap); - rc = devm_add_action_or_reset(dev, nfit_test_kill, pgmap); - if (rc) - return ERR_PTR(rc); - return nfit_res->buf + offset - nfit_res->res.start; + pgmap->dev = dev; + if (!pgmap->ref) { + if (pgmap->ops && (pgmap->ops->kill || pgmap->ops->cleanup)) + return ERR_PTR(-EINVAL); + + init_completion(&pgmap->done); + error = percpu_ref_init(&pgmap->internal_ref, +dev_pagemap_percpu_release, 0, GFP_KERNEL); + if (error) + return ERR_PTR(error); + pgmap->ref = &pgmap->internal_ref; + } else { + if (!pgmap->ops || !pgmap->ops->kill || !pgmap->ops->cleanup) { + WARN(1, "Missing reference count teardown definition\n"); + return ERR_PTR(-EINVAL); + } } - return devm_memremap_pages(dev, pgmap); + + error = devm_add_action_or_reset(dev, nfit_test_kill, pgmap); + if (error) + return ERR_PTR(error); + return nfit_res->buf + offset - nfit_res->res.start; } EXPORT_SYMBOL_GPL(__wrap_devm_memremap_pages); -- 2.20.1 ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: dev_pagemap related cleanups v2
On Mon, Jun 17, 2019 at 5:27 AM Christoph Hellwig wrote: > > Hi Dan, Jérôme and Jason, > > below is a series that cleans up the dev_pagemap interface so that > it is more easily usable, which removes the need to wrap it in hmm > and thus allowing to kill a lot of code > > Note: this series is on top of the rdma/hmm branch + the dev_pagemap > releas fix series from Dan that went into 5.2-rc5. > > Git tree: > > git://git.infradead.org/users/hch/misc.git hmm-devmem-cleanup.2 > > Gitweb: > > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/hmm-devmem-cleanup.2 > > Changes since v1: > - rebase > - also switch p2pdma to the internal refcount > - add type checking for pgmap->type > - rename the migrate method to migrate_to_ram > - cleanup the altmap_valid flag > - various tidbits from the reviews Attached is my incremental fixups on top of this series, with those integrated you can add: Tested-by: Dan Williams ...to the patches that touch kernel/memremap.c, drivers/dax, and drivers/nvdimm. You can also add: Reviewed-by: Dan Williams ...for the series. diff --git a/drivers/dax/device.c b/drivers/dax/device.c index a9d7c90ecf1e..1af823b2fe6b 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -428,6 +428,7 @@ int dev_dax_probe(struct device *dev) return -EBUSY; } + dev_dax->pgmap.type = MEMORY_DEVICE_DEVDAX; addr = devm_memremap_pages(dev, &dev_dax->pgmap); if (IS_ERR(addr)) return PTR_ERR(addr); diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig index 54500798f23a..57d3a6c3ac70 100644 --- a/drivers/nvdimm/Kconfig +++ b/drivers/nvdimm/Kconfig @@ -118,4 +118,15 @@ config NVDIMM_KEYS depends on ENCRYPTED_KEYS depends on (LIBNVDIMM=ENCRYPTED_KEYS) || LIBNVDIMM=m +config NVDIMM_TEST_BUILD + bool "Build the unit test core" + depends on COMPILE_TEST + default COMPILE_TEST + help + Build the core of the unit test infrastructure. The result of + this build is non-functional for unit test execution, but it + otherwise helps catch build errors induced by changes to the + core devm_memremap_pages() implementation and other + infrastructure. + endif diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile index 6f2a088afad6..40080c120363 100644 --- a/drivers/nvdimm/Makefile +++ b/drivers/nvdimm/Makefile @@ -28,3 +28,7 @@ libnvdimm-$(CONFIG_BTT) += btt_devs.o libnvdimm-$(CONFIG_NVDIMM_PFN) += pfn_devs.o libnvdimm-$(CONFIG_NVDIMM_DAX) += dax_devs.o libnvdimm-$(CONFIG_NVDIMM_KEYS) += security.o + +TOOLS := ../../tools +TEST_SRC := $(TOOLS)/testing/nvdimm/test +obj-$(CONFIG_NVDIMM_TEST_BUILD) := $(TEST_SRC)/iomap.o diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 7e0f072ddce7..470de68dabd6 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -55,12 +55,19 @@ struct vmem_altmap { * MEMORY_DEVICE_PCI_P2PDMA: * Device memory residing in a PCI BAR intended for use with Peer-to-Peer * transactions. + * + * MEMORY_DEVICE_DEVDAX: + * Host memory that has similar access semantics as System RAM i.e. DMA + * coherent and supports page pinning. In contrast to + * MEMORY_DEVICE_FS_DAX, this memory is access via a device-dax + * character device. */ enum memory_type { MEMORY_DEVICE_PRIVATE = 1, MEMORY_DEVICE_PUBLIC, MEMORY_DEVICE_FS_DAX, MEMORY_DEVICE_PCI_P2PDMA, + MEMORY_DEVICE_DEVDAX, }; struct dev_pagemap_ops { diff --git a/kernel/memremap.c b/kernel/memremap.c index 60693a1e8e92..52b4968e62cd 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -173,6 +173,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) }; pgprot_t pgprot = PAGE_KERNEL; int error, nid, is_ram; + bool get_ops = true; switch (pgmap->type) { case MEMORY_DEVICE_PRIVATE: @@ -199,6 +200,8 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) } break; case MEMORY_DEVICE_PCI_P2PDMA: + case MEMORY_DEVICE_DEVDAX: + get_ops = false; break; default: WARN(1, "Invalid pgmap type %d\n", pgmap->type); @@ -222,7 +225,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) } } - if (pgmap->type != MEMORY_DEVICE_PCI_P2PDMA) { + if (get_ops) { error = dev_pagemap_get_ops(dev, pgmap); if (error) return ERR_PTR(error); diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c index 8cd9b9873a7f..9019dd8afbc1 100644 --- a/tools/testing/nvdimm/test/iomap.c +++ b/tools/testing/nvdimm/test/iomap.c @@ -106,7 +106,7 @@ EXPORT_SYMBOL(__wrap_devm_memremap); static void nfit_test_kill(void *_pgmap) { - WARN_ON(!pgmap || !pgmap->ref) + struct dev_pagemap *pgmap = _pgmap; if (pgmap->ops && pgmap->ops->kill) pgmap->ops->kill(pgmap); @@ -121,20 +121,45 @@ static void nfit_test_kill(void *_pgmap) } } +static void dev_pagemap_percpu_
Re: dev_pagemap related cleanups v2
On Wed, Jun 19, 2019 at 9:37 AM Jason Gunthorpe wrote: > > On Wed, Jun 19, 2019 at 11:40:32AM +0200, Christoph Hellwig wrote: > > On Tue, Jun 18, 2019 at 12:47:10PM -0700, Dan Williams wrote: > > > > Git tree: > > > > > > > > git://git.infradead.org/users/hch/misc.git hmm-devmem-cleanup.2 > > > > > > > > Gitweb: > > > > > > > > > > > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/hmm-devmem-cleanup.2 > > > > > > > > Attached is my incremental fixups on top of this series, with those > > > integrated you can add: > > > > I've folded your incremental bits in and pushed out a new > > hmm-devmem-cleanup.3 to the repo above. Let me know if I didn't mess > > up anything else. I'll wait for a few more comments and Jason's > > planned rebase of the hmm branch before reposting. > > I said I wouldn't rebase the hmm.git (as it needs to go to DRM, AMD > and RDMA git trees).. > > Instead I will merge v5.2-rc5 to the tree before applying this series. > > I've understood this to be Linus's prefered workflow. > > So, please send the next iteration of this against either > plainv5.2-rc5 or v5.2-rc5 merged with hmm.git and I'll sort it out. Just make sure that when you backmerge v5.2-rc5 you have a clear reason in the merge commit message about why you needed to do it. While needless rebasing is top of the pet peeve list, second place, as I found out, is mystery merges without explanations. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 18/22] mm: mark DEVICE_PUBLIC as broken
On Wed, Jun 19, 2019 at 12:42 PM Jason Gunthorpe wrote: > > On Thu, Jun 13, 2019 at 06:23:04PM -0700, John Hubbard wrote: > > On 6/13/19 5:43 PM, Ira Weiny wrote: > > > On Thu, Jun 13, 2019 at 07:58:29PM +, Jason Gunthorpe wrote: > > >> On Thu, Jun 13, 2019 at 12:53:02PM -0700, Ralph Campbell wrote: > > >>> > > ... > > >> Hum, so the only thing this config does is short circuit here: > > >> > > >> static inline bool is_device_public_page(const struct page *page) > > >> { > > >> return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && > > >> IS_ENABLED(CONFIG_DEVICE_PUBLIC) && > > >> is_zone_device_page(page) && > > >> page->pgmap->type == MEMORY_DEVICE_PUBLIC; > > >> } > > >> > > >> Which is called all over the place.. > > > > > > yes but the earlier patch: > > > > > > [PATCH 03/22] mm: remove hmm_devmem_add_resource > > > > > > Removes the only place type is set to MEMORY_DEVICE_PUBLIC. > > > > > > So I think it is ok. Frankly I was wondering if we should remove the > > > public > > > type altogether but conceptually it seems ok. But I don't see any users > > > of it > > > so... should we get rid of it in the code rather than turning the config > > > off? > > > > > > Ira > > > > That seems reasonable. I recall that the hope was for those IBM Power 9 > > systems to use _PUBLIC, as they have hardware-based coherent device (GPU) > > memory, and so the memory really is visible to the CPU. And the IBM team > > was thinking of taking advantage of it. But I haven't seen anything on > > that front for a while. > > Does anyone know who those people are and can we encourage them to > send some patches? :) I expect marking it CONFIG_BROKEN with the threat of deleting it if no patches show up *is* the encouragement. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 05/22] mm: export alloc_pages_vma
On Thu, Jun 20, 2019 at 12:17 PM Michal Hocko wrote: > > On Thu 13-06-19 11:43:08, Christoph Hellwig wrote: > > noveau is currently using this through an odd hmm wrapper, and I plan > > to switch it to the real thing later in this series. > > > > Signed-off-by: Christoph Hellwig > > --- > > mm/mempolicy.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > index 01600d80ae01..f9023b5fba37 100644 > > --- a/mm/mempolicy.c > > +++ b/mm/mempolicy.c > > @@ -2098,6 +2098,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct > > vm_area_struct *vma, > > out: > > return page; > > } > > +EXPORT_SYMBOL_GPL(alloc_pages_vma); > > All allocator exported symbols are EXPORT_SYMBOL, what is a reason to > have this one special? I asked for this simply because it was not exported historically. In general I want to establish explicit export-type criteria so the community can spend less time debating when to use EXPORT_SYMBOL_GPL [1]. The thought in this instance is that it is not historically exported to modules and it is safer from a maintenance perspective to start with GPL-only for new symbols in case we don't want to maintain that interface long-term for out-of-tree modules. Yes, we always reserve the right to remove / change interfaces regardless of the export type, but history has shown that external pressure to keep an interface stable (contrary to Documentation/process/stable-api-nonsense.rst) tends to be less for GPL-only exports. [1]: https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2018-September/005688.html ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 05/22] mm: export alloc_pages_vma
On Tue, Jun 25, 2019 at 8:01 AM Michal Hocko wrote: > > On Tue 25-06-19 09:23:17, Christoph Hellwig wrote: > > On Mon, Jun 24, 2019 at 11:24:48AM -0700, Dan Williams wrote: > > > I asked for this simply because it was not exported historically. In > > > general I want to establish explicit export-type criteria so the > > > community can spend less time debating when to use EXPORT_SYMBOL_GPL > > > [1]. > > > > > > The thought in this instance is that it is not historically exported > > > to modules and it is safer from a maintenance perspective to start > > > with GPL-only for new symbols in case we don't want to maintain that > > > interface long-term for out-of-tree modules. > > > > > > Yes, we always reserve the right to remove / change interfaces > > > regardless of the export type, but history has shown that external > > > pressure to keep an interface stable (contrary to > > > Documentation/process/stable-api-nonsense.rst) tends to be less for > > > GPL-only exports. > > > > Fully agreed. In the end the decision is with the MM maintainers, > > though, although I'd prefer to keep it as in this series. > > I am sorry but I am not really convinced by the above reasoning wrt. to > the allocator API and it has been a subject of many changes over time. I > do not remember a single case where we would be bending the allocator > API because of external modules and I am pretty sure we will push back > heavily if that was the case in the future. This seems to say that you have no direct experience of dealing with changing symbols that that a prominent out-of-tree module needs? GPU drivers and the core-mm are on a path to increase their cooperation on memory management mechanisms over time, and symbol export changes for out-of-tree GPU drivers have been a significant source of friction in the past. > So in this particular case I would go with consistency and export the > same way we do with other functions. Also we do not want people to > reinvent this API and screw that like we have seen in other cases when > external modules try reimplement core functionality themselves. Consistency is a weak argument when the cost to the upstream community is negligible. If the same functionality was available via another / already exported interface *that* would be an argument to maintain the existing export policy. "Consistency" in and of itself is not a precedent we can use more widely in default export-type decisions. Effectively I'm arguing EXPORT_SYMBOL_GPL by default with a later decision to drop the _GPL. Similar to how we are careful to mark sysfs interfaces in Documentation/ABI/ that we are not fully committed to maintaining over time, or are otherwise so new that there is not yet a good read on whether they can be made permanent. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 05/22] mm: export alloc_pages_vma
On Tue, Jun 25, 2019 at 12:01 PM Michal Hocko wrote: > > On Tue 25-06-19 11:03:53, Dan Williams wrote: > > On Tue, Jun 25, 2019 at 8:01 AM Michal Hocko wrote: > > > > > > On Tue 25-06-19 09:23:17, Christoph Hellwig wrote: > > > > On Mon, Jun 24, 2019 at 11:24:48AM -0700, Dan Williams wrote: > > > > > I asked for this simply because it was not exported historically. In > > > > > general I want to establish explicit export-type criteria so the > > > > > community can spend less time debating when to use EXPORT_SYMBOL_GPL > > > > > [1]. > > > > > > > > > > The thought in this instance is that it is not historically exported > > > > > to modules and it is safer from a maintenance perspective to start > > > > > with GPL-only for new symbols in case we don't want to maintain that > > > > > interface long-term for out-of-tree modules. > > > > > > > > > > Yes, we always reserve the right to remove / change interfaces > > > > > regardless of the export type, but history has shown that external > > > > > pressure to keep an interface stable (contrary to > > > > > Documentation/process/stable-api-nonsense.rst) tends to be less for > > > > > GPL-only exports. > > > > > > > > Fully agreed. In the end the decision is with the MM maintainers, > > > > though, although I'd prefer to keep it as in this series. > > > > > > I am sorry but I am not really convinced by the above reasoning wrt. to > > > the allocator API and it has been a subject of many changes over time. I > > > do not remember a single case where we would be bending the allocator > > > API because of external modules and I am pretty sure we will push back > > > heavily if that was the case in the future. > > > > This seems to say that you have no direct experience of dealing with > > changing symbols that that a prominent out-of-tree module needs? GPU > > drivers and the core-mm are on a path to increase their cooperation on > > memory management mechanisms over time, and symbol export changes for > > out-of-tree GPU drivers have been a significant source of friction in > > the past. > > I have an experience e.g. to rework semantic of some gfp flags and that is > something that users usualy get wrong and never heard that an out of > tree code would insist on an old semantic and pushing us to the corner. > > > > So in this particular case I would go with consistency and export the > > > same way we do with other functions. Also we do not want people to > > > reinvent this API and screw that like we have seen in other cases when > > > external modules try reimplement core functionality themselves. > > > > Consistency is a weak argument when the cost to the upstream community > > is negligible. If the same functionality was available via another / > > already exported interface *that* would be an argument to maintain the > > existing export policy. "Consistency" in and of itself is not a > > precedent we can use more widely in default export-type decisions. > > > > Effectively I'm arguing EXPORT_SYMBOL_GPL by default with a later > > decision to drop the _GPL. Similar to how we are careful to mark sysfs > > interfaces in Documentation/ABI/ that we are not fully committed to > > maintaining over time, or are otherwise so new that there is not yet a > > good read on whether they can be made permanent. > > Documentation/process/stable-api-nonsense.rst That document has failed to preclude symbol export fights in the past and there is a reasonable argument to try not to retract functionality that had been previously exported regardless of that document. > Really. If you want to play with GPL vs. EXPORT_SYMBOL else this is up > to you but I do not see any technical argument to make this particular > interface to the page allocator any different from all others that are > exported to modules. I'm failing to find any practical substance to your argument, but in the end I agree with Chrishoph, it's up to MM maintainers. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 04/25] mm: remove MEMORY_DEVICE_PUBLIC support
[ add Ira ] On Wed, Jun 26, 2019 at 5:27 AM Christoph Hellwig wrote: > > The code hasn't been used since it was added to the tree, and doesn't > appear to actually be usable. > > Signed-off-by: Christoph Hellwig > Reviewed-by: Jason Gunthorpe > Acked-by: Michal Hocko [..] > diff --git a/mm/swap.c b/mm/swap.c > index 7ede3eddc12a..83107410d29f 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -740,17 +740,6 @@ void release_pages(struct page **pages, int nr) > if (is_huge_zero_page(page)) > continue; > > - /* Device public page can not be huge page */ > - if (is_device_public_page(page)) { > - if (locked_pgdat) { > - > spin_unlock_irqrestore(&locked_pgdat->lru_lock, > - flags); > - locked_pgdat = NULL; > - } > - put_devmap_managed_page(page); > - continue; > - } > - This collides with Ira's bug fix [1]. The MEMORY_DEVICE_FSDAX case needs this to be converted to be independent of "public" pages. Perhaps it should be pulled out of -mm and incorporated in this series. [1]: https://lore.kernel.org/lkml/20190605214922.17684-1-ira.we...@intel.com/ ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 05/22] mm: export alloc_pages_vma
On Tue, Jun 25, 2019 at 10:46 PM Michal Hocko wrote: > > On Tue 25-06-19 12:52:18, Dan Williams wrote: > [...] > > > Documentation/process/stable-api-nonsense.rst > > > > That document has failed to preclude symbol export fights in the past > > and there is a reasonable argument to try not to retract functionality > > that had been previously exported regardless of that document. > > Can you point me to any specific example where this would be the case > for the core kernel symbols please? The most recent example that comes to mind was the thrash around __kernel_fpu_{begin,end} [1]. I referenced that when debating _GPL symbol policy with Jérôme [2]. [1]: https://lore.kernel.org/lkml/20190522100959.ga15...@kroah.com/ [2]: https://lore.kernel.org/lkml/CAPcyv4gb+r==rikfxkvz7ggdnke62ybmz7xoa4ubbbyhnk9...@mail.gmail.com/ ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 16/25] device-dax: use the dev_pagemap internal refcount
On Fri, Jun 28, 2019 at 8:39 AM Jason Gunthorpe wrote: > > On Wed, Jun 26, 2019 at 02:27:15PM +0200, Christoph Hellwig wrote: > > The functionality is identical to the one currently open coded in > > device-dax. > > > > Signed-off-by: Christoph Hellwig > > Reviewed-by: Ira Weiny > > --- > > drivers/dax/dax-private.h | 4 > > drivers/dax/device.c | 43 --- > > 2 files changed, 47 deletions(-) > > DanW: I think this series has reached enough review, did you want > to ack/test any further? > > This needs to land in hmm.git soon to make the merge window. I was awaiting a decision about resolving the collision with Ira's patch before testing the final result again [1]. You can go ahead and add my reviewed-by for the series, but my tested-by should be on the final state of the series. [1]: https://lore.kernel.org/lkml/CAPcyv4gTOf+EWzSGrFrh2GrTZt5HVR=e+xicukepiy57px8...@mail.gmail.com/ ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 16/25] device-dax: use the dev_pagemap internal refcount
On Fri, Jun 28, 2019 at 10:02 AM Jason Gunthorpe wrote: > > On Fri, Jun 28, 2019 at 09:27:44AM -0700, Dan Williams wrote: > > On Fri, Jun 28, 2019 at 8:39 AM Jason Gunthorpe wrote: > > > > > > On Wed, Jun 26, 2019 at 02:27:15PM +0200, Christoph Hellwig wrote: > > > > The functionality is identical to the one currently open coded in > > > > device-dax. > > > > > > > > Signed-off-by: Christoph Hellwig > > > > Reviewed-by: Ira Weiny > > > > drivers/dax/dax-private.h | 4 > > > > drivers/dax/device.c | 43 --- > > > > 2 files changed, 47 deletions(-) > > > > > > DanW: I think this series has reached enough review, did you want > > > to ack/test any further? > > > > > > This needs to land in hmm.git soon to make the merge window. > > > > I was awaiting a decision about resolving the collision with Ira's > > patch before testing the final result again [1]. You can go ahead and > > add my reviewed-by for the series, but my tested-by should be on the > > final state of the series. > > The conflict looks OK to me, I think we can let Andrew and Linus > resolve it. > Andrew's tree effectively always rebases since it's a quilt series. I'd recommend pulling Ira's patch out of -mm and applying it with the rest of hmm reworks. Any other git tree I'd agree with just doing the late conflict resolution, but I'm not clear on what's the best practice when conflicting with -mm. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 16/25] device-dax: use the dev_pagemap internal refcount
On Fri, Jun 28, 2019 at 10:08 AM Dan Williams wrote: > > On Fri, Jun 28, 2019 at 10:02 AM Jason Gunthorpe wrote: > > > > On Fri, Jun 28, 2019 at 09:27:44AM -0700, Dan Williams wrote: > > > On Fri, Jun 28, 2019 at 8:39 AM Jason Gunthorpe wrote: > > > > > > > > On Wed, Jun 26, 2019 at 02:27:15PM +0200, Christoph Hellwig wrote: > > > > > The functionality is identical to the one currently open coded in > > > > > device-dax. > > > > > > > > > > Signed-off-by: Christoph Hellwig > > > > > Reviewed-by: Ira Weiny > > > > > drivers/dax/dax-private.h | 4 > > > > > drivers/dax/device.c | 43 > > > > > --- > > > > > 2 files changed, 47 deletions(-) > > > > > > > > DanW: I think this series has reached enough review, did you want > > > > to ack/test any further? > > > > > > > > This needs to land in hmm.git soon to make the merge window. > > > > > > I was awaiting a decision about resolving the collision with Ira's > > > patch before testing the final result again [1]. You can go ahead and > > > add my reviewed-by for the series, but my tested-by should be on the > > > final state of the series. > > > > The conflict looks OK to me, I think we can let Andrew and Linus > > resolve it. > > > > Andrew's tree effectively always rebases since it's a quilt series. > I'd recommend pulling Ira's patch out of -mm and applying it with the > rest of hmm reworks. Any other git tree I'd agree with just doing the > late conflict resolution, but I'm not clear on what's the best > practice when conflicting with -mm. Regardless the patch is buggy. If you want to do the conflict resolution it should be because the DEVICE_PUBLIC removal effectively does the same fix otherwise we're knowingly leaving a broken point in the history. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 16/25] device-dax: use the dev_pagemap internal refcount
On Fri, Jun 28, 2019 at 11:29 AM Jason Gunthorpe wrote: > > On Fri, Jun 28, 2019 at 10:10:12AM -0700, Dan Williams wrote: > > On Fri, Jun 28, 2019 at 10:08 AM Dan Williams > > wrote: > > > > > > On Fri, Jun 28, 2019 at 10:02 AM Jason Gunthorpe > > > wrote: > > > > > > > > On Fri, Jun 28, 2019 at 09:27:44AM -0700, Dan Williams wrote: > > > > > On Fri, Jun 28, 2019 at 8:39 AM Jason Gunthorpe > > > > > wrote: > > > > > > > > > > > > On Wed, Jun 26, 2019 at 02:27:15PM +0200, Christoph Hellwig wrote: > > > > > > > The functionality is identical to the one currently open coded in > > > > > > > device-dax. > > > > > > > > > > > > > > Signed-off-by: Christoph Hellwig > > > > > > > Reviewed-by: Ira Weiny > > > > > > > drivers/dax/dax-private.h | 4 > > > > > > > drivers/dax/device.c | 43 > > > > > > > --- > > > > > > > 2 files changed, 47 deletions(-) > > > > > > > > > > > > DanW: I think this series has reached enough review, did you want > > > > > > to ack/test any further? > > > > > > > > > > > > This needs to land in hmm.git soon to make the merge window. > > > > > > > > > > I was awaiting a decision about resolving the collision with Ira's > > > > > patch before testing the final result again [1]. You can go ahead and > > > > > add my reviewed-by for the series, but my tested-by should be on the > > > > > final state of the series. > > > > > > > > The conflict looks OK to me, I think we can let Andrew and Linus > > > > resolve it. > > > > > > Andrew's tree effectively always rebases since it's a quilt series. > > > I'd recommend pulling Ira's patch out of -mm and applying it with the > > > rest of hmm reworks. Any other git tree I'd agree with just doing the > > > late conflict resolution, but I'm not clear on what's the best > > > practice when conflicting with -mm. > > What happens depends on timing as things arrive to Linus. I promised > to send hmm.git early, so I understand that Andrew will quilt rebase > his tree to Linus's and fix the conflict in Ira's patch before he > sends it. > > > Regardless the patch is buggy. If you want to do the conflict > > resolution it should be because the DEVICE_PUBLIC removal effectively > > does the same fix otherwise we're knowingly leaving a broken point in > > the history. > > I'm not sure I understand your concern, is there something wrong with > CH's series as it stands? hmm is a non-rebasing git tree, so as long > as the series is correct *when I apply it* there is no broken history. > > I assumed the conflict resolution for Ira's patch was to simply take > the deletion of the if block from CH's series - right? > > If we do need to take Ira's patch into hmm.git it will go after CH's > series (and Ira will have to rebase/repost it), so I think there is > nothing to do at this moment - unless you are saying there is a > problem with the series in CH's git tree? There is a problem with the series in CH's tree. It removes the ->page_free() callback from the release_pages() path because it goes too far and removes the put_devmap_managed_page() call. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 16/25] device-dax: use the dev_pagemap internal refcount
On Fri, Jun 28, 2019 at 11:52 AM Christoph Hellwig wrote: > > On Fri, Jun 28, 2019 at 11:44:35AM -0700, Dan Williams wrote: > > There is a problem with the series in CH's tree. It removes the > > ->page_free() callback from the release_pages() path because it goes > > too far and removes the put_devmap_managed_page() call. > > release_pages only called put_devmap_managed_page for device public > pages. So I can't see how that is in any way a problem. It's a bug that the call to put_devmap_managed_page() was gated by MEMORY_DEVICE_PUBLIC in release_pages(). That path is also applicable to MEMORY_DEVICE_FSDAX because it needs to trigger the ->page_free() callback to wake up wait_on_var() via fsdax_pagefree(). So I guess you could argue that the MEMORY_DEVICE_PUBLIC removal patch left the original bug in place. In that sense we're no worse off, but since we know about the bug, the fix and the patches have not been applied yet, why not fix it now? ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 16/25] device-dax: use the dev_pagemap internal refcount
On Fri, Jun 28, 2019 at 12:02 PM Christoph Hellwig wrote: > > On Fri, Jun 28, 2019 at 11:59:19AM -0700, Dan Williams wrote: > > It's a bug that the call to put_devmap_managed_page() was gated by > > MEMORY_DEVICE_PUBLIC in release_pages(). That path is also applicable > > to MEMORY_DEVICE_FSDAX because it needs to trigger the ->page_free() > > callback to wake up wait_on_var() via fsdax_pagefree(). > > > > So I guess you could argue that the MEMORY_DEVICE_PUBLIC removal patch > > left the original bug in place. In that sense we're no worse off, but > > since we know about the bug, the fix and the patches have not been > > applied yet, why not fix it now? > > The fix it now would simply be to apply Ira original patch now, but > given that we are at -rc6 is this really a good time? And if we don't > apply it now based on the quilt based -mm worflow it just seems a lot > easier to apply it after my series. Unless we want to include it in > the series, in which case I can do a quick rebase, we'd just need to > make sure Andrew pulls it from -mm. I believe -mm auto drops patches when they appear in the -next baseline. So it should "just work" to pull it into the series and send it along for -next inclusion. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: dev_pagemap related cleanups v4
On Tue, Jul 2, 2019 at 11:42 AM Jason Gunthorpe wrote: > > On Mon, Jul 01, 2019 at 10:25:17AM +0200, Christoph Hellwig wrote: > > And I've demonstrated that I can't send patch series.. While this > > has all the right patches, it also has the extra patches already > > in the hmm tree, and four extra patches I wanted to send once > > this series is merged. I'll give up for now, please use the git > > url for anything serious, as it contains the right thing. > > Okay, I sorted it all out and temporarily put it here: > > https://github.com/jgunthorpe/linux/commits/hmm > > Bit involved job: > - Took Ira's v4 patch into hmm.git and confirmed it matches what > Andrew has in linux-next after all the fixups > - Checked your github v4 and the v3 that hit the mailing list were > substantially similar (I never did get a clean v4) and largely > went with the github version > - Based CH's v4 series on -rc7 and put back the removal hunk in swap.c > so it compiles > - Merge'd CH's series to hmm.git and fixed all the conflicts with Ira > and Ralph's patches (such that swap.c remains unchanged) > - Added Dan's ack's and tested-by's Looks good. Test merge (with some collisions, see below) also passes my test suite. > > I think this fairly closely follows what was posted to the mailing > list. > > As it was more than a simple 'git am', I'll let it sit on github until > I hear OK's then I'll move it to kernel.org's hmm.git and it will hit > linux-next. 0-day should also run on this whole thing from my github. > > What I know is outstanding: > - The conflicting ARM patches, I understand Andrew will handle these >post-linux-next > - The conflict with AMD GPU in -next, I am waiting to hear from AMD Just a heads up that this also collides with the "sub-section" patches in Andrew's tree. The resolution is straightforward, mostly just colliding updates to arch_{add,remove}_memory() call sites in kernel/memremap.c and collisions with pgmap_altmap() usage. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v5 13/29] compat_ioctl: move more drivers to compat_ptr_ioctl
On Tue, Jul 30, 2019 at 12:59 PM Arnd Bergmann wrote: > > The .ioctl and .compat_ioctl file operations have the same prototype so > they can both point to the same function, which works great almost all > the time when all the commands are compatible. > > One exception is the s390 architecture, where a compat pointer is only > 31 bit wide, and converting it into a 64-bit pointer requires calling > compat_ptr(). Most drivers here will never run in s390, but since we now > have a generic helper for it, it's easy enough to use it consistently. > > I double-checked all these drivers to ensure that all ioctl arguments > are used as pointers or are ignored, but are not interpreted as integer > values. > > Acked-by: Jason Gunthorpe > Acked-by: Daniel Vetter > Acked-by: Mauro Carvalho Chehab > Acked-by: Greg Kroah-Hartman > Acked-by: David Sterba > Acked-by: Darren Hart (VMware) > Acked-by: Jonathan Cameron > Acked-by: Bjorn Andersson > Signed-off-by: Arnd Bergmann > --- > drivers/nvdimm/bus.c| 4 ++-- [..] > diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c > index 798c5c4aea9c..6ca142d833ab 100644 > --- a/drivers/nvdimm/bus.c > +++ b/drivers/nvdimm/bus.c > @@ -1229,7 +1229,7 @@ static const struct file_operations nvdimm_bus_fops = { > .owner = THIS_MODULE, > .open = nd_open, > .unlocked_ioctl = bus_ioctl, > - .compat_ioctl = bus_ioctl, > + .compat_ioctl = compat_ptr_ioctl, > .llseek = noop_llseek, > }; > > @@ -1237,7 +1237,7 @@ static const struct file_operations nvdimm_fops = { > .owner = THIS_MODULE, > .open = nd_open, > .unlocked_ioctl = dimm_ioctl, > - .compat_ioctl = dimm_ioctl, > + .compat_ioctl = compat_ptr_ioctl, > .llseek = noop_llseek, > }; Acked-by: Dan Williams ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 1/8] mm: remove a pointless CONFIG_ZONE_DEVICE check in memremap_pages
On Sun, Feb 6, 2022 at 10:33 PM Christoph Hellwig wrote: > > memremap.c is only built when CONFIG_ZONE_DEVICE is set, so remove > the superflous extra check. Looks good to me. Reviewed-by: Dan Williams
Re: [PATCH 2/8] mm: remove the __KERNEL__ guard from
On Sun, Feb 6, 2022 at 10:33 PM Christoph Hellwig wrote: > > __KERNEL__ ifdefs don't make sense outside of include/uapi/. Yes. Reviewed-by: Dan Williams
Re: [PATCH 4/8] mm: move free_devmap_managed_page to memremap.c
On Sun, Feb 6, 2022 at 10:33 PM Christoph Hellwig wrote: > > free_devmap_managed_page has nothing to do with the code in swap.c, > move it to live with the rest of the code for devmap handling. > Looks good. Reviewed-by: Dan Williams
Re: [PATCH 5/8] mm: simplify freeing of devmap managed pages
On Sun, Feb 6, 2022 at 10:33 PM Christoph Hellwig wrote: > > Make put_devmap_managed_page return if it took charge of the page > or not and remove the separate page_is_devmap_managed helper. Looks good to me: Reviewed-by: Dan Williams
Re: [PATCH 6/8] mm: don't include in
On Sun, Feb 6, 2022 at 10:33 PM Christoph Hellwig wrote: > > Move the check for the actual pgmap types that need the free at refcount > one behavior into the out of line helper, and thus avoid the need to > pull memremap.h into mm.h. Looks good to me assuming the compile bots agree. Reviewed-by: Dan Williams > > Signed-off-by: Christoph Hellwig > --- > arch/arm64/mm/mmu.c| 1 + > drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 1 + > drivers/gpu/drm/drm_cache.c| 2 +- > drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 + > drivers/gpu/drm/nouveau/nouveau_svm.c | 1 + > drivers/infiniband/core/rw.c | 1 + > drivers/nvdimm/pmem.h | 1 + > drivers/nvme/host/pci.c| 1 + > drivers/nvme/target/io-cmd-bdev.c | 1 + > fs/fuse/virtio_fs.c| 1 + > include/linux/memremap.h | 18 ++ > include/linux/mm.h | 20 > lib/test_hmm.c | 1 + > mm/memremap.c | 6 +- > 14 files changed, 34 insertions(+), 22 deletions(-) > > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c > index acfae9b41cc8c9..580abae6c0b93f 100644 > --- a/arch/arm64/mm/mmu.c > +++ b/arch/arm64/mm/mmu.c > @@ -17,6 +17,7 @@ > #include > #include > #include > +#include > #include > #include > #include > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h > b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h > index ea68f3b3a4e9cb..6d643b4b791d87 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h > @@ -25,6 +25,7 @@ > > #include > #include > +#include > #include > #include > #include > diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c > index f19d9acbe95936..50b8a088f763a6 100644 > --- a/drivers/gpu/drm/drm_cache.c > +++ b/drivers/gpu/drm/drm_cache.c > @@ -27,11 +27,11 @@ > /* > * Authors: Thomas Hellström > */ > - > #include > #include > #include > #include > +#include > #include > > #include > diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c > b/drivers/gpu/drm/nouveau/nouveau_dmem.c > index e886a3b9e08c7d..a5cdfbe32b5e54 100644 > --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c > +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c > @@ -39,6 +39,7 @@ > > #include > #include > +#include > #include > > /* > diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c > b/drivers/gpu/drm/nouveau/nouveau_svm.c > index 266809e511e2c1..090b9b47708cca 100644 > --- a/drivers/gpu/drm/nouveau/nouveau_svm.c > +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c > @@ -35,6 +35,7 @@ > #include > #include > #include > +#include > #include > > struct nouveau_svm { > diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c > index 5a3bd41b331c93..4d98f931a13ddd 100644 > --- a/drivers/infiniband/core/rw.c > +++ b/drivers/infiniband/core/rw.c > @@ -2,6 +2,7 @@ > /* > * Copyright (c) 2016 HGST, a Western Digital Company. > */ > +#include > #include > #include > #include > diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h > index 59cfe13ea8a85c..1f51a23614299b 100644 > --- a/drivers/nvdimm/pmem.h > +++ b/drivers/nvdimm/pmem.h > @@ -3,6 +3,7 @@ > #define __NVDIMM_PMEM_H__ > #include > #include > +#include > #include > #include > #include > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 6a99ed68091589..ab15bc72710dbe 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -15,6 +15,7 @@ > #include > #include > #include > +#include > #include > #include > #include > diff --git a/drivers/nvme/target/io-cmd-bdev.c > b/drivers/nvme/target/io-cmd-bdev.c > index 70ca9dfc1771a9..a141446db1bea3 100644 > --- a/drivers/nvme/target/io-cmd-bdev.c > +++ b/drivers/nvme/target/io-cmd-bdev.c > @@ -6,6 +6,7 @@ > #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > #include > #include > +#include > #include > #include "nvmet.h" > > diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c > index 9d737904d07c0b..86b7dbb6a0d43e 100644 > --- a/fs/fuse/virtio_fs.c > +++ b/fs/fuse/virtio_fs.c > @@ -8,6 +8,7 @@ > #include > #include > #include > +#include > #include > #include > #include > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > index 1fafcc38acbad6..514ab46f597e5c 100644 > --- a/include/linux/memremap.h > +++ b/include/linux/memremap.h > @@ -1,6 +1,8
Re: [PATCH 6/8] mm: don't include in
On Mon, Feb 7, 2022 at 3:49 PM Dan Williams wrote: > > On Sun, Feb 6, 2022 at 10:33 PM Christoph Hellwig wrote: > > > > Move the check for the actual pgmap types that need the free at refcount > > one behavior into the out of line helper, and thus avoid the need to > > pull memremap.h into mm.h. > > Looks good to me assuming the compile bots agree. > > Reviewed-by: Dan Williams Yeah, same as Logan: mm/memcontrol.c: In function ‘get_mctgt_type’: mm/memcontrol.c:5724:29: error: implicit declaration of function ‘is_device_private_page’; did you mean ‘is_device_private_entry’? [-Werror=implicit-function-declaration] 5724 | if (is_device_private_page(page)) | ^~ | is_device_private_entry ...needs: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d1e97a54ae53..0ac7515c85f9 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -62,6 +62,7 @@ #include #include #include +#include #include "internal.h" #include #include
Re: [PATCH 7/8] mm: remove the extra ZONE_DEVICE struct page refcount
On Sun, Feb 6, 2022 at 10:33 PM Christoph Hellwig wrote: [..] > @@ -500,28 +482,27 @@ void free_devmap_managed_page(struct page *page) > */ > page->mapping = NULL; > page->pgmap->ops->page_free(page); > + > + /* > +* Reset the page count to 1 to prepare for handing out the page > again. > +*/ > + set_page_count(page, 1); Interesting. I had expected that to really fix the refcount problem that fs/dax.c would need to start taking real page references as pages were added to a mapping, just like page cache. This looks ok to me, and passes my tests. So given I'm still working my way back to fixing the references properly I'm ok for this hack to replace the more broken hack that is there presently. Reviewed-by: Dan Williams
RE: [PATCH][next] treewide: uapi: Replace zero-length arrays with flexible-array members
Gustavo A. R. Silva wrote: > There is a regular need in the kernel to provide a way to declare > having a dynamically sized set of trailing elements in a structure. > Kernel code should always use “flexible array members”[1] for these > cases. The older style of one-element or zero-length arrays should > no longer be used[2]. > > This code was transformed with the help of Coccinelle: > (linux-5.19-rc2$ spatch --jobs $(getconf _NPROCESSORS_ONLN) --sp-file > script.cocci --include-headers --dir . > output.patch) > > @@ > identifier S, member, array; > type T1, T2; > @@ > > struct S { > ... > T1 member; > T2 array[ > - 0 > ]; > }; > > -fstrict-flex-arrays=3 is coming and we need to land these changes > to prevent issues like these in the short future: > > ../fs/minix/dir.c:337:3: warning: 'strcpy' will always overflow; destination > buffer has size 0, > but the source string has length 2 (including NUL byte) [-Wfortify-source] > strcpy(de3->name, "."); > ^ > > Since these are all [0] to [] changes, the risk to UAPI is nearly zero. If > this breaks anything, we can use a union with a new member name. > > [1] https://en.wikipedia.org/wiki/Flexible_array_member > [2] > https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays > > Link: https://github.com/KSPP/linux/issues/78 > Build-tested-by: > https://lore.kernel.org/lkml/62b675ec.wkx6aoz6cbe71vtf%25...@intel.com/ > Signed-off-by: Gustavo A. R. Silva > --- > Hi all! > > JFYI: I'm adding this to my -next tree. :) > [..] > include/uapi/linux/ndctl.h| 10 +-- For ndctl.h Acked-by: Dan Williams
Re: [PATCH v2 4/4] vfio/pci: Allow MMIO regions to be exported through dma-buf
Christoph Hellwig wrote: > On Wed, Sep 07, 2022 at 09:33:11AM -0300, Jason Gunthorpe wrote: > > Yes, you said that, and I said that when the AMD driver first merged > > it - but it went in anyhow and now people are using it in a bunch of > > places. > > drm folks made up their own weird rules, if they internally stick > to it they have to listen to it given that they ignore review comments, > but it violates the scatterlist API and has not business anywhere > else in the kernel. And yes, there probably is a reason or two why > the drm code is unusually error prone. > > > > Why would small BARs be problematic for the pages? The pages are more > > > a problem for gigantic BARs do the memory overhead. > > > > How do I get a struct page * for a 4k BAR in vfio? > > I guess we have different definitions of small then :) > > But unless my understanding of the code is out out of data, > memremap_pages just requires the (virtual) start address to be 2MB > aligned, not the size. Adding Dan for comments. The minimum granularity for sparse_add_section() that memremap_pages uses internally is 2MB, so start and end need to be 2MB aligned. Details here: ba72b4c8cf60 mm/sparsemem: support sub-section hotplug
Re: [PATCH 2/7] mm: Free device private pages have zero refcount
Alistair Popple wrote: > > Jason Gunthorpe writes: > > > On Mon, Sep 26, 2022 at 04:03:06PM +1000, Alistair Popple wrote: > >> Since 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page > >> refcount") device private pages have no longer had an extra reference > >> count when the page is in use. However before handing them back to the > >> owning device driver we add an extra reference count such that free > >> pages have a reference count of one. > >> > >> This makes it difficult to tell if a page is free or not because both > >> free and in use pages will have a non-zero refcount. Instead we should > >> return pages to the drivers page allocator with a zero reference count. > >> Kernel code can then safely use kernel functions such as > >> get_page_unless_zero(). > >> > >> Signed-off-by: Alistair Popple > >> --- > >> arch/powerpc/kvm/book3s_hv_uvmem.c | 1 + > >> drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 + > >> drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 + > >> lib/test_hmm.c | 1 + > >> mm/memremap.c| 5 - > >> mm/page_alloc.c | 6 ++ > >> 6 files changed, 10 insertions(+), 5 deletions(-) > > > > I think this is a great idea, but I'm surprised no dax stuff is > > touched here? > > free_zone_device_page() shouldn't be called for pgmap->type == > MEMORY_DEVICE_FS_DAX so I don't think we should have to worry about DAX > there. Except that the folio code looks like it might have introduced a > bug. AFAICT put_page() always calls > put_devmap_managed_page(&folio->page) but folio_put() does not (although > folios_put() does!). So it seems folio_put() won't end up calling > __put_devmap_managed_page_refs() as I think it should. > > I think you're right about the change to __init_zone_device_page() - I > should limit it to DEVICE_PRIVATE/COHERENT pages only. But I need to > look at Dan's patch series more closely as I suspect it might be better > to rebase this patch on top of that. Apologies for the delay I was travelling the past few days. Yes, I think this patch slots in nicely to avoid the introduction of an init_mode [1]: https://lore.kernel.org/nvdimm/166329940343.2786261.6047770378829215962.st...@dwillia2-xfh.jf.intel.com/ Mind if I steal it into my series?
Re: [PATCH 2/7] mm: Free device private pages have zero refcount
Alistair Popple wrote: > > Dan Williams writes: > > > Alistair Popple wrote: > >> > >> Jason Gunthorpe writes: > >> > >> > On Mon, Sep 26, 2022 at 04:03:06PM +1000, Alistair Popple wrote: > >> >> Since 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page > >> >> refcount") device private pages have no longer had an extra reference > >> >> count when the page is in use. However before handing them back to the > >> >> owning device driver we add an extra reference count such that free > >> >> pages have a reference count of one. > >> >> > >> >> This makes it difficult to tell if a page is free or not because both > >> >> free and in use pages will have a non-zero refcount. Instead we should > >> >> return pages to the drivers page allocator with a zero reference count. > >> >> Kernel code can then safely use kernel functions such as > >> >> get_page_unless_zero(). > >> >> > >> >> Signed-off-by: Alistair Popple > >> >> --- > >> >> arch/powerpc/kvm/book3s_hv_uvmem.c | 1 + > >> >> drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 + > >> >> drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 + > >> >> lib/test_hmm.c | 1 + > >> >> mm/memremap.c| 5 - > >> >> mm/page_alloc.c | 6 ++ > >> >> 6 files changed, 10 insertions(+), 5 deletions(-) > >> > > >> > I think this is a great idea, but I'm surprised no dax stuff is > >> > touched here? > >> > >> free_zone_device_page() shouldn't be called for pgmap->type == > >> MEMORY_DEVICE_FS_DAX so I don't think we should have to worry about DAX > >> there. Except that the folio code looks like it might have introduced a > >> bug. AFAICT put_page() always calls > >> put_devmap_managed_page(&folio->page) but folio_put() does not (although > >> folios_put() does!). So it seems folio_put() won't end up calling > >> __put_devmap_managed_page_refs() as I think it should. > >> > >> I think you're right about the change to __init_zone_device_page() - I > >> should limit it to DEVICE_PRIVATE/COHERENT pages only. But I need to > >> look at Dan's patch series more closely as I suspect it might be better > >> to rebase this patch on top of that. > > > > Apologies for the delay I was travelling the past few days. Yes, I think > > this patch slots in nicely to avoid the introduction of an init_mode > > [1]: > > > > https://lore.kernel.org/nvdimm/166329940343.2786261.6047770378829215962.st...@dwillia2-xfh.jf.intel.com/ > > > > Mind if I steal it into my series? > > No problem, although I notice Andrew has already merged it into > mm-unstable. If you end up rebasing your series on top of mine I think > all that's needed is a patch somewhere in your series to drop the > various `if (pgmap->type == MEMORY_DEVICE_*)` I added to (hopefully) > avoid breaking DAX. Assuming DAX takes a pagemap reference on struct > page allocation something like below. Yeah, I'll go that route and rebase on top of -mm. Thanks again.
RE: [PATCH v2 16/28] resource: Introduce alloc_free_mem_region()
[ add dri-devel and nouveau ] Dan Williams wrote: > The core of devm_request_free_mem_region() is a helper that searches for > free space in iomem_resource and performs __request_region_locked() on > the result of that search. The policy choices of the implementation > conform to what CONFIG_DEVICE_PRIVATE users want which is memory that is > immediately marked busy, and a preference to search for the first-fit > free range in descending order from the top of the physical address > space. > > CXL has a need for a similar allocator, but with the following tweaks: > > 1/ Search for free space in ascending order > > 2/ Search for free space relative to a given CXL window > > 3/ 'insert' rather than 'request' the new resource given downstream >drivers from the CXL Region driver (like the pmem or dax drivers) are >responsible for request_mem_region() when they activate the memory >range. > > Rework __request_free_mem_region() into get_free_mem_region() which > takes a set of GFR_* (Get Free Region) flags to control the allocation > policy (ascending vs descending), and "busy" policy (insert_resource() > vs request_region()). > > As part of the consolidation of the legacy GFR_REQUEST_REGION case with > the new default of just inserting a new resource into the free space > some minor cleanups like not checking for NULL before calling > devres_free() (which does its own check) is included. > > Suggested-by: Jason Gunthorpe > Link: https://lore.kernel.org/linux-cxl/20220420143406.gy2120...@nvidia.com/ > Cc: Matthew Wilcox > Cc: Christoph Hellwig Jason, Christoph, anyone else that depends on CONFIG_DEVICE_PRIVATE, Just a heads up that with Jonathan's review I am going to proceed with pushing this change to linux-next. Please holler if CONFIG_DEVICE_PRIVATE starts misbehaving, or if you have other feedback, and I will get it fixed up. > Signed-off-by: Dan Williams > --- > include/linux/ioport.h |2 + > kernel/resource.c | 178 > +++- > mm/Kconfig |5 + > 3 files changed, 150 insertions(+), 35 deletions(-) > > diff --git a/include/linux/ioport.h b/include/linux/ioport.h > index 79d1ad6d6275..616b683563a9 100644 > --- a/include/linux/ioport.h > +++ b/include/linux/ioport.h > @@ -330,6 +330,8 @@ struct resource *devm_request_free_mem_region(struct > device *dev, > struct resource *base, unsigned long size); > struct resource *request_free_mem_region(struct resource *base, > unsigned long size, const char *name); > +struct resource *alloc_free_mem_region(struct resource *base, > + unsigned long size, unsigned long align, const char *name); > > static inline void irqresource_disabled(struct resource *res, u32 irq) > { > diff --git a/kernel/resource.c b/kernel/resource.c > index 53a534db350e..4c5e80b92f2f 100644 > --- a/kernel/resource.c > +++ b/kernel/resource.c > @@ -489,8 +489,9 @@ int __weak page_is_ram(unsigned long pfn) > } > EXPORT_SYMBOL_GPL(page_is_ram); > > -static int __region_intersects(resource_size_t start, size_t size, > - unsigned long flags, unsigned long desc) > +static int __region_intersects(struct resource *parent, resource_size_t > start, > +size_t size, unsigned long flags, > +unsigned long desc) > { > struct resource res; > int type = 0; int other = 0; > @@ -499,7 +500,7 @@ static int __region_intersects(resource_size_t start, > size_t size, > res.start = start; > res.end = start + size - 1; > > - for (p = iomem_resource.child; p ; p = p->sibling) { > + for (p = parent->child; p ; p = p->sibling) { > bool is_type = (((p->flags & flags) == flags) && > ((desc == IORES_DESC_NONE) || >(desc == p->desc))); > @@ -543,7 +544,7 @@ int region_intersects(resource_size_t start, size_t size, > unsigned long flags, > int ret; > > read_lock(&resource_lock); > - ret = __region_intersects(start, size, flags, desc); > + ret = __region_intersects(&iomem_resource, start, size, flags, desc); > read_unlock(&resource_lock); > > return ret; > @@ -1780,62 +1781,139 @@ void resource_list_free(struct list_head *head) > } > EXPORT_SYMBOL(resource_list_free); > > -#ifdef CONFIG_DEVICE_PRIVATE > -static struct resource *__request_free_mem_region(struct device *dev, > - struct resource *base, unsigned long size, const char *name) > +#ifdef CONFIG_GET_FREE_REGION > +#defi
Re: [PATCH v3 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM
On Mon, Nov 11, 2019 at 4:07 PM John Hubbard wrote: > > As it says in the updated comment in gup.c: current FOLL_LONGTERM > behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the > FS DAX check requirement on vmas. > > However, the corresponding restriction in get_user_pages_remote() was > slightly stricter than is actually required: it forbade all > FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers > that do not set the "locked" arg. > > Update the code and comments accordingly, and update the VFIO caller > to take advantage of this, fixing a bug as a result: the VFIO caller > is logically a FOLL_LONGTERM user. > > Thanks to Jason Gunthorpe for pointing out a clean way to fix this. > > Suggested-by: Jason Gunthorpe > Cc: Jerome Glisse > Cc: Ira Weiny > Signed-off-by: John Hubbard > --- > drivers/vfio/vfio_iommu_type1.c | 30 +- > mm/gup.c| 13 - > 2 files changed, 21 insertions(+), 22 deletions(-) > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > index d864277ea16f..017689b7c32b 100644 > --- a/drivers/vfio/vfio_iommu_type1.c > +++ b/drivers/vfio/vfio_iommu_type1.c > @@ -348,24 +348,20 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned > long vaddr, > flags |= FOLL_WRITE; > > down_read(&mm->mmap_sem); > - if (mm == current->mm) { > - ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page, > -vmas); > - } else { > - ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page, > - vmas, NULL); > - /* > -* The lifetime of a vaddr_get_pfn() page pin is > -* userspace-controlled. In the fs-dax case this could > -* lead to indefinite stalls in filesystem operations. > -* Disallow attempts to pin fs-dax pages via this > -* interface. > -*/ > - if (ret > 0 && vma_is_fsdax(vmas[0])) { > - ret = -EOPNOTSUPP; > - put_page(page[0]); > - } > + ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM, > + page, vmas, NULL); Hmm, what's the point of passing FOLL_LONGTERM to get_user_pages_remote() if get_user_pages_remote() is not going to check the vma? I think we got to this code state because the get_user_pages() vs get_user_pages_remote() split predated the introduction of FOLL_LONGTERM. I think check_vma_flags() should do the ((FOLL_LONGTERM | FOLL_GET) && vma_is_fsdax()) check and that would also remove the need for __gup_longterm_locked. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v3 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM
On Tue, Nov 12, 2019 at 2:24 PM John Hubbard wrote: > > On 11/12/19 1:57 PM, Dan Williams wrote: > ... > >> diff --git a/drivers/vfio/vfio_iommu_type1.c > >> b/drivers/vfio/vfio_iommu_type1.c > >> index d864277ea16f..017689b7c32b 100644 > >> --- a/drivers/vfio/vfio_iommu_type1.c > >> +++ b/drivers/vfio/vfio_iommu_type1.c > >> @@ -348,24 +348,20 @@ static int vaddr_get_pfn(struct mm_struct *mm, > >> unsigned long vaddr, > >> flags |= FOLL_WRITE; > >> > >> down_read(&mm->mmap_sem); > >> - if (mm == current->mm) { > >> - ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page, > >> -vmas); > >> - } else { > >> - ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, > >> page, > >> - vmas, NULL); > >> - /* > >> -* The lifetime of a vaddr_get_pfn() page pin is > >> -* userspace-controlled. In the fs-dax case this could > >> -* lead to indefinite stalls in filesystem operations. > >> -* Disallow attempts to pin fs-dax pages via this > >> -* interface. > >> -*/ > >> - if (ret > 0 && vma_is_fsdax(vmas[0])) { > >> - ret = -EOPNOTSUPP; > >> - put_page(page[0]); > >> - } > >> + ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | > >> FOLL_LONGTERM, > >> + page, vmas, NULL); > > > > Hmm, what's the point of passing FOLL_LONGTERM to > > get_user_pages_remote() if get_user_pages_remote() is not going to > > check the vma? I think we got to this code state because the > > FOLL_LONGTERM is short-lived in this location, because patch 23 > ("mm/gup: remove support for gup(FOLL_LONGTERM)") removes it, after > callers are changed over to pin_longterm_pages*(). > > So FOLL_LONGTERM is not doing much now, but it is basically a marker for > "change gup(FOLL_LONGTERM) to pin_longterm_pages()", and patch 18 > actually makes that change. > > And then pin_longterm_pages*() is, in turn, a way to mark all the > places that need file system and/or user space interactions (layout > leases, etc), as per "Case 2: RDMA" in the new > Documentation/vm/pin_user_pages.rst. Ah, sorry. This was the first time I had looked at this series and jumped in without reading the background. Your patch as is looks ok, I assume you've removed the FOLL_LONGTERM warning in get_user_pages_remote in another patch? > > > get_user_pages() vs get_user_pages_remote() split predated the > > introduction of FOLL_LONGTERM. > > Yes. And I do want clean this up as I go, so we don't end up with > stale concepts lingering in gup.c... > > > > > I think check_vma_flags() should do the ((FOLL_LONGTERM | FOLL_GET) && > > vma_is_fsdax()) check and that would also remove the need for > > __gup_longterm_locked. > > > > Good idea, but there is still the call to check_and_migrate_cma_pages(), > inside __gup_longterm_locked(). So it's a little more involved and > we can't trivially delete __gup_longterm_locked() yet, right? [ add Aneesh ] Yes, you're right. I had overlooked that had snuck in there. That to me similarly needs to be pushed down into the core with its own FOLL flag, or it needs to be an explicit fixup that each caller does after get_user_pages. The fact that migration silently happens as a side effect of gup is too magical for my taste. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v3 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM
On Tue, Nov 12, 2019 at 2:43 PM John Hubbard wrote: > > On 11/12/19 12:43 PM, Jason Gunthorpe wrote: > ... > >> -} > >> +ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM, > >> +page, vmas, NULL); > >> +/* > >> + * The lifetime of a vaddr_get_pfn() page pin is > >> + * userspace-controlled. In the fs-dax case this could > >> + * lead to indefinite stalls in filesystem operations. > >> + * Disallow attempts to pin fs-dax pages via this > >> + * interface. > >> + */ > >> +if (ret > 0 && vma_is_fsdax(vmas[0])) { > >> +ret = -EOPNOTSUPP; > >> +put_page(page[0]); > >> } > > > > AFAIK this chunk is redundant now as it is some hack to emulate > > FOLL_LONGTERM? So vmas can be deleted too. > > Let me first make sure I understand what Dan has in mind for the vma > checking, in the other thread... It's not redundant relative to upstream which does not do anything the FOLL_LONGTERM in the gup-slow path... but I have not looked at patches 1-7 to see if something there made it redundant. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v3 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM
On Tue, Nov 12, 2019 at 3:08 PM John Hubbard wrote: > > On 11/12/19 2:43 PM, Dan Williams wrote: > ... > > Ah, sorry. This was the first time I had looked at this series and > > jumped in without reading the background. > > > > Your patch as is looks ok, I assume you've removed the FOLL_LONGTERM > > warning in get_user_pages_remote in another patch? > > > > Actually, I haven't gone quite that far. Actually this patch is the last > change to that function. Therefore, at the end of this patchset, > get_user_pages_remote() ends up with this check in it which > is a less-restrictive version of the warning: > > /* > * Current FOLL_LONGTERM behavior is incompatible with > * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on > * vmas. However, this only comes up if locked is set, and there are > * callers that do request FOLL_LONGTERM, but do not set locked. So, > * allow what we can. > */ > if (gup_flags & FOLL_LONGTERM) { > if (WARN_ON_ONCE(locked)) > return -EINVAL; > } > > Is that OK, or did you want to go further (possibly in a follow-up > patchset, as I'm hoping to get this one in soon)? That looks ok. Something to maybe push down into the core in a future cleanup, but not something that needs to be done now. > ... > >>> I think check_vma_flags() should do the ((FOLL_LONGTERM | FOLL_GET) && > >>> vma_is_fsdax()) check and that would also remove the need for > >>> __gup_longterm_locked. > >>> > >> > >> Good idea, but there is still the call to check_and_migrate_cma_pages(), > >> inside __gup_longterm_locked(). So it's a little more involved and > >> we can't trivially delete __gup_longterm_locked() yet, right? > > > > [ add Aneesh ] > > > > Yes, you're right. I had overlooked that had snuck in there. That to > > me similarly needs to be pushed down into the core with its own FOLL > > flag, or it needs to be an explicit fixup that each caller does after > > get_user_pages. The fact that migration silently happens as a side > > effect of gup is too magical for my taste. > > > > Yes. It's an intrusive side effect that is surprising, and not in a > "happy surprise" way. :) . Fixing up the CMA pages by splitting that > functionality into separate function calls sounds like an improvement > worth exploring. Right, future work. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v3 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM
On Tue, Nov 12, 2019 at 3:43 PM Jason Gunthorpe wrote: > > On Tue, Nov 12, 2019 at 02:45:51PM -0800, Dan Williams wrote: > > On Tue, Nov 12, 2019 at 2:43 PM John Hubbard wrote: > > > > > > On 11/12/19 12:43 PM, Jason Gunthorpe wrote: > > > ... > > > >> -} > > > >> +ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | > > > >> FOLL_LONGTERM, > > > >> +page, vmas, NULL); > > > >> +/* > > > >> + * The lifetime of a vaddr_get_pfn() page pin is > > > >> + * userspace-controlled. In the fs-dax case this could > > > >> + * lead to indefinite stalls in filesystem operations. > > > >> + * Disallow attempts to pin fs-dax pages via this > > > >> + * interface. > > > >> + */ > > > >> +if (ret > 0 && vma_is_fsdax(vmas[0])) { > > > >> +ret = -EOPNOTSUPP; > > > >> +put_page(page[0]); > > > >> } > > > > > > > > AFAIK this chunk is redundant now as it is some hack to emulate > > > > FOLL_LONGTERM? So vmas can be deleted too. > > > > > > Let me first make sure I understand what Dan has in mind for the vma > > > checking, in the other thread... > > > > It's not redundant relative to upstream which does not do anything the > > FOLL_LONGTERM in the gup-slow path... but I have not looked at patches > > 1-7 to see if something there made it redundant. > > Oh, the hunk John had below for get_user_pages_remote() also needs to > call __gup_longterm_locked() when FOLL_LONGTERM is specified, then > that calls check_dax_vmas() which duplicates the vma_is_fsdax() check > above. Oh true, good eye. It is redundant if it does additionally call __gup_longterm_locked(), and it needs to do that otherwises it undoes the CMA migration magic that Aneesh added. > > Certainly no caller of FOLL_LONGTERM should have to do dax specific > VMA checking. Agree, that was my comment about cleaning up the vma_is_fsdax() check to be internal to the gup core. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v3 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM
On Tue, Nov 12, 2019 at 5:08 PM John Hubbard wrote: > > On 11/12/19 4:58 PM, Dan Williams wrote: > ... > >>> It's not redundant relative to upstream which does not do anything the > >>> FOLL_LONGTERM in the gup-slow path... but I have not looked at patches > >>> 1-7 to see if something there made it redundant. > >> > >> Oh, the hunk John had below for get_user_pages_remote() also needs to > >> call __gup_longterm_locked() when FOLL_LONGTERM is specified, then > >> that calls check_dax_vmas() which duplicates the vma_is_fsdax() check > >> above. > > > > Oh true, good eye. It is redundant if it does additionally call > > __gup_longterm_locked(), and it needs to do that otherwises it undoes > > the CMA migration magic that Aneesh added. > > > > OK. So just to be clear, I'll be removing this from the patch: > > /* > * The lifetime of a vaddr_get_pfn() page pin is > * userspace-controlled. In the fs-dax case this could > * lead to indefinite stalls in filesystem operations. > * Disallow attempts to pin fs-dax pages via this > * interface. > */ > if (ret > 0 && vma_is_fsdax(vmas[0])) { > ret = -EOPNOTSUPP; > put_page(page[0]); > } > > (and the declaration of "vmas", as well). ...and add a call to __gup_longterm_locked internal to get_user_pages_remote(), right? ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v4 09/23] mm/gup: introduce pin_user_pages*() and FOLL_PIN
On Wed, Nov 13, 2019 at 2:43 AM Jan Kara wrote: > > On Tue 12-11-19 20:26:56, John Hubbard wrote: > > Introduce pin_user_pages*() variations of get_user_pages*() calls, > > and also pin_longterm_pages*() variations. > > > > These variants all set FOLL_PIN, which is also introduced, and > > thoroughly documented. > > > > The pin_longterm*() variants also set FOLL_LONGTERM, in addition > > to FOLL_PIN: > > > > pin_user_pages() > > pin_user_pages_remote() > > pin_user_pages_fast() > > > > pin_longterm_pages() > > pin_longterm_pages_remote() > > pin_longterm_pages_fast() > > > > All pages that are pinned via the above calls, must be unpinned via > > put_user_page(). > > > > The underlying rules are: > > > > * These are gup-internal flags, so the call sites should not directly > > set FOLL_PIN nor FOLL_LONGTERM. That behavior is enforced with > > assertions, for the new FOLL_PIN flag. However, for the pre-existing > > FOLL_LONGTERM flag, which has some call sites that still directly > > set FOLL_LONGTERM, there is no assertion yet. > > > > * Call sites that want to indicate that they are going to do DirectIO > > ("DIO") or something with similar characteristics, should call a > > get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers > > will: > > * Start with "pin_user_pages" instead of "get_user_pages". That > > makes it easy to find and audit the call sites. > > * Set FOLL_PIN > > > > * For pages that are received via FOLL_PIN, those pages must be returned > > via put_user_page(). > > > > Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases > > in this documentation. (I've reworded it and expanded upon it.) > > > > Reviewed-by: Mike Rapoport # Documentation > > Reviewed-by: Jérôme Glisse > > Cc: Jonathan Corbet > > Cc: Ira Weiny > > Signed-off-by: John Hubbard > > Thanks for the documentation. It looks great! > > > diff --git a/mm/gup.c b/mm/gup.c > > index 83702b2e86c8..4409e84dff51 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -201,6 +201,10 @@ static struct page *follow_page_pte(struct > > vm_area_struct *vma, > > spinlock_t *ptl; > > pte_t *ptep, pte; > > > > + /* FOLL_GET and FOLL_PIN are mutually exclusive. */ > > + if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) == > > + (FOLL_PIN | FOLL_GET))) > > + return ERR_PTR(-EINVAL); > > retry: > > if (unlikely(pmd_bad(*pmd))) > > return no_page_table(vma, flags); > > How does FOLL_PIN result in grabbing (at least normal, for now) page > reference? > I didn't find that anywhere in this patch but it is a prerequisite to > converting any user to pin_user_pages() interface, right? > > > +/** > > + * pin_user_pages_fast() - pin user pages in memory without taking locks > > + * > > + * Nearly the same as get_user_pages_fast(), except that FOLL_PIN is set. > > See > > + * get_user_pages_fast() for documentation on the function arguments, > > because > > + * the arguments here are identical. > > + * > > + * FOLL_PIN means that the pages must be released via put_user_page(). > > Please > > + * see Documentation/vm/pin_user_pages.rst for further details. > > + * > > + * This is intended for Case 1 (DIO) in > > Documentation/vm/pin_user_pages.rst. It > > + * is NOT intended for Case 2 (RDMA: long-term pins). > > + */ > > +int pin_user_pages_fast(unsigned long start, int nr_pages, > > + unsigned int gup_flags, struct page **pages) > > +{ > > + /* FOLL_GET and FOLL_PIN are mutually exclusive. */ > > + if (WARN_ON_ONCE(gup_flags & FOLL_GET)) > > + return -EINVAL; > > + > > + gup_flags |= FOLL_PIN; > > + return internal_get_user_pages_fast(start, nr_pages, gup_flags, > > pages); > > +} > > +EXPORT_SYMBOL_GPL(pin_user_pages_fast); > > I was somewhat wondering about the number of functions you add here. So we > have: > > pin_user_pages() > pin_user_pages_fast() > pin_user_pages_remote() > > and then longterm variants: > > pin_longterm_pages() > pin_longterm_pages_fast() > pin_longterm_pages_remote() > > and obviously we have gup like: > get_user_pages() > get_user_pages_fast() > get_user_pages_remote() > ... and some other gup variants ... > > I think we really should have pin_* vs get_* variants as they are very > different in terms of guarantees and after conversion, any use of get_* > variant in non-mm code should be closely scrutinized. OTOH pin_longterm_* > don't look *that* useful to me and just using pin_* instead with > FOLL_LONGTERM flag would look OK to me and somewhat reduce the number of > functions which is already large enough? What do people think? I don't feel > too strongly about this but wanted to bring this up. I'd vote for FOLL_LONGTERM should obviate the need for {get,pin}_user_pages_longterm(). It's a property that is passed by the call site, not an internal flag. ___ dri-devel mailing
Re: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
On Tue, Nov 12, 2019 at 8:27 PM John Hubbard wrote: > > An upcoming patch changes and complicates the refcounting and > especially the "put page" aspects of it. In order to keep > everything clean, refactor the devmap page release routines: > > * Rename put_devmap_managed_page() to page_is_devmap_managed(), > and limit the functionality to "read only": return a bool, > with no side effects. > > * Add a new routine, put_devmap_managed_page(), to handle checking > what kind of page it is, and what kind of refcount handling it > requires. > > * Rename __put_devmap_managed_page() to free_devmap_managed_page(), > and limit the functionality to unconditionally freeing a devmap > page. > > This is originally based on a separate patch by Ira Weiny, which > applied to an early version of the put_user_page() experiments. > Since then, Jérôme Glisse suggested the refactoring described above. > > Suggested-by: Jérôme Glisse > Signed-off-by: Ira Weiny > Signed-off-by: John Hubbard > --- > include/linux/mm.h | 27 --- > mm/memremap.c | 67 -- > 2 files changed, 53 insertions(+), 41 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index a2adf95b3f9c..96228376139c 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -967,9 +967,10 @@ static inline bool is_zone_device_page(const struct page > *page) > #endif > > #ifdef CONFIG_DEV_PAGEMAP_OPS > -void __put_devmap_managed_page(struct page *page); > +void free_devmap_managed_page(struct page *page); > DECLARE_STATIC_KEY_FALSE(devmap_managed_key); > -static inline bool put_devmap_managed_page(struct page *page) > + > +static inline bool page_is_devmap_managed(struct page *page) > { > if (!static_branch_unlikely(&devmap_managed_key)) > return false; > @@ -978,7 +979,6 @@ static inline bool put_devmap_managed_page(struct page > *page) > switch (page->pgmap->type) { > case MEMORY_DEVICE_PRIVATE: > case MEMORY_DEVICE_FS_DAX: > - __put_devmap_managed_page(page); > return true; > default: > break; > @@ -986,6 +986,27 @@ static inline bool put_devmap_managed_page(struct page > *page) > return false; > } > > +static inline bool put_devmap_managed_page(struct page *page) > +{ > + bool is_devmap = page_is_devmap_managed(page); > + > + if (is_devmap) { > + int count = page_ref_dec_return(page); > + > + /* > +* devmap page refcounts are 1-based, rather than 0-based: if > +* refcount is 1, then the page is free and the refcount is > +* stable because nobody holds a reference on the page. > +*/ > + if (count == 1) > + free_devmap_managed_page(page); > + else if (!count) > + __put_page(page); > + } > + > + return is_devmap; > +} > + > #else /* CONFIG_DEV_PAGEMAP_OPS */ > static inline bool put_devmap_managed_page(struct page *page) > { > diff --git a/mm/memremap.c b/mm/memremap.c > index 03ccbdfeb697..bc7e2a27d025 100644 > --- a/mm/memremap.c > +++ b/mm/memremap.c > @@ -410,48 +410,39 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn, > EXPORT_SYMBOL_GPL(get_dev_pagemap); > > #ifdef CONFIG_DEV_PAGEMAP_OPS > -void __put_devmap_managed_page(struct page *page) > +void free_devmap_managed_page(struct page *page) > { > - int count = page_ref_dec_return(page); > + /* Clear Active bit in case of parallel mark_page_accessed */ > + __ClearPageActive(page); > + __ClearPageWaiters(page); > + > + mem_cgroup_uncharge(page); Ugh, when did all this HMM specific manipulation sneak into the generic ZONE_DEVICE path? It used to be gated by pgmap type with its own put_zone_device_private_page(). For example it's certainly unnecessary and might be broken (would need to check) to call mem_cgroup_uncharge() on a DAX page. ZONE_DEVICE users are not a monolith and the HMM use case leaks pages into code paths that DAX explicitly avoids. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
On Wed, Nov 13, 2019 at 11:23 AM Dan Williams wrote: > > On Tue, Nov 12, 2019 at 8:27 PM John Hubbard wrote: > > > > An upcoming patch changes and complicates the refcounting and > > especially the "put page" aspects of it. In order to keep > > everything clean, refactor the devmap page release routines: > > > > * Rename put_devmap_managed_page() to page_is_devmap_managed(), > > and limit the functionality to "read only": return a bool, > > with no side effects. > > > > * Add a new routine, put_devmap_managed_page(), to handle checking > > what kind of page it is, and what kind of refcount handling it > > requires. > > > > * Rename __put_devmap_managed_page() to free_devmap_managed_page(), > > and limit the functionality to unconditionally freeing a devmap > > page. > > > > This is originally based on a separate patch by Ira Weiny, which > > applied to an early version of the put_user_page() experiments. > > Since then, Jérôme Glisse suggested the refactoring described above. > > > > Suggested-by: Jérôme Glisse > > Signed-off-by: Ira Weiny > > Signed-off-by: John Hubbard > > --- > > include/linux/mm.h | 27 --- > > mm/memremap.c | 67 -- > > 2 files changed, 53 insertions(+), 41 deletions(-) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index a2adf95b3f9c..96228376139c 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -967,9 +967,10 @@ static inline bool is_zone_device_page(const struct > > page *page) > > #endif > > > > #ifdef CONFIG_DEV_PAGEMAP_OPS > > -void __put_devmap_managed_page(struct page *page); > > +void free_devmap_managed_page(struct page *page); > > DECLARE_STATIC_KEY_FALSE(devmap_managed_key); > > -static inline bool put_devmap_managed_page(struct page *page) > > + > > +static inline bool page_is_devmap_managed(struct page *page) > > { > > if (!static_branch_unlikely(&devmap_managed_key)) > > return false; > > @@ -978,7 +979,6 @@ static inline bool put_devmap_managed_page(struct page > > *page) > > switch (page->pgmap->type) { > > case MEMORY_DEVICE_PRIVATE: > > case MEMORY_DEVICE_FS_DAX: > > - __put_devmap_managed_page(page); > > return true; > > default: > > break; > > @@ -986,6 +986,27 @@ static inline bool put_devmap_managed_page(struct page > > *page) > > return false; > > } > > > > +static inline bool put_devmap_managed_page(struct page *page) > > +{ > > + bool is_devmap = page_is_devmap_managed(page); > > + > > + if (is_devmap) { > > + int count = page_ref_dec_return(page); > > + > > + /* > > +* devmap page refcounts are 1-based, rather than 0-based: > > if > > +* refcount is 1, then the page is free and the refcount is > > +* stable because nobody holds a reference on the page. > > +*/ > > + if (count == 1) > > + free_devmap_managed_page(page); > > + else if (!count) > > + __put_page(page); > > + } > > + > > + return is_devmap; > > +} > > + > > #else /* CONFIG_DEV_PAGEMAP_OPS */ > > static inline bool put_devmap_managed_page(struct page *page) > > { > > diff --git a/mm/memremap.c b/mm/memremap.c > > index 03ccbdfeb697..bc7e2a27d025 100644 > > --- a/mm/memremap.c > > +++ b/mm/memremap.c > > @@ -410,48 +410,39 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn, > > EXPORT_SYMBOL_GPL(get_dev_pagemap); > > > > #ifdef CONFIG_DEV_PAGEMAP_OPS > > -void __put_devmap_managed_page(struct page *page) > > +void free_devmap_managed_page(struct page *page) > > { > > - int count = page_ref_dec_return(page); > > + /* Clear Active bit in case of parallel mark_page_accessed */ > > + __ClearPageActive(page); > > + __ClearPageWaiters(page); > > + > > + mem_cgroup_uncharge(page); > > Ugh, when did all this HMM specific manipulation sneak into the > generic ZONE_DEVICE path? It used to be gated by pgmap type with its > own put_zone_device_private_page(). For example it's certainly > unnecessary and might be broken (would need to check) to call > mem_cgroup_uncharge() on a DA
Re: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
On Wed, Nov 13, 2019 at 2:49 PM John Hubbard wrote: > > On 11/13/19 2:00 PM, Dan Williams wrote: > ... > >> Ugh, when did all this HMM specific manipulation sneak into the > >> generic ZONE_DEVICE path? It used to be gated by pgmap type with its > >> own put_zone_device_private_page(). For example it's certainly > >> unnecessary and might be broken (would need to check) to call > >> mem_cgroup_uncharge() on a DAX page. ZONE_DEVICE users are not a > >> monolith and the HMM use case leaks pages into code paths that DAX > >> explicitly avoids. > > > > It's been this way for a while and I did not react previously, > > apologies for that. I think __ClearPageActive, __ClearPageWaiters, and > > mem_cgroup_uncharge, belong behind a device-private conditional. The > > history here is: > > > > Move some, but not all HMM specifics to hmm_devmem_free(): > > 2fa147bdbf67 mm, dev_pagemap: Do not clear ->mapping on final put > > > > Remove the clearing of mapping since no upstream consumers needed it: > > b7a523109fb5 mm: don't clear ->mapping in hmm_devmem_free > > > > Add it back in once an upstream consumer arrived: > > 7ab0ad0e74f8 mm/hmm: fix ZONE_DEVICE anon page mapping reuse > > > > We're now almost entirely free of ->page_free callbacks except for > > that weird nouveau case, can that FIXME in nouveau_dmem_page_free() > > also result in killing the ->page_free() callback altogether? In the > > meantime I'm proposing a cleanup like this: > > > OK, assuming this is acceptable (no obvious problems jump out at me, > and we can also test it with HMM), then how would you like to proceed, as > far as patches go: add such a patch as part of this series here, or as a > stand-alone patch either before or after this series? Or something else? > And did you plan on sending it out as such? I think it makes sense to include it in your series since you're looking to refactor the implementation. I can send you one based on current linux-next as a lead-in cleanup before the refactor. Does that work for you? > > Also, the diffs didn't quite make it through intact to my "git apply", so > I'm re-posting the diff in hopes that this time it survives: Apologies for that. For quick "how about this" patch examples, I just copy and paste into gmail and it sometimes clobbers it. > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > index f9f76f6ba07b..21db1ce8c0ae 100644 > --- a/drivers/nvdimm/pmem.c > +++ b/drivers/nvdimm/pmem.c > @@ -338,13 +338,7 @@ static void pmem_release_disk(void *__pmem) > put_disk(pmem->disk); > } > > -static void pmem_pagemap_page_free(struct page *page) > -{ > - wake_up_var(&page->_refcount); > -} > - > static const struct dev_pagemap_ops fsdax_pagemap_ops = { > - .page_free = pmem_pagemap_page_free, > .kill = pmem_pagemap_kill, > .cleanup= pmem_pagemap_cleanup, > }; > diff --git a/mm/memremap.c b/mm/memremap.c > index 03ccbdfeb697..157edb8f7cf8 100644 > --- a/mm/memremap.c > +++ b/mm/memremap.c > @@ -419,12 +419,6 @@ void __put_devmap_managed_page(struct page *page) > * holds a reference on the page. > */ > if (count == 1) { > - /* Clear Active bit in case of parallel mark_page_accessed */ > - __ClearPageActive(page); > - __ClearPageWaiters(page); > - > - mem_cgroup_uncharge(page); > - > /* > * When a device_private page is freed, the page->mapping > field > * may still contain a (stale) mapping value. For example, the > @@ -446,10 +440,17 @@ void __put_devmap_managed_page(struct page *page) > * handled differently or not done at all, so there is no need > * to clear page->mapping. > */ > - if (is_device_private_page(page)) > - page->mapping = NULL; > + if (is_device_private_page(page)) { > + /* Clear Active bit in case of parallel > mark_page_accessed */ > + __ClearPageActive(page); > + __ClearPageWaiters(page); > > - page->pgmap->ops->page_free(page); > + mem_cgroup_uncharge(page); > + > + page->mapping = NULL; > + page->pgmap->ops->page_free(page); > + } else > + wake_up_var(&
Re: [PATCH v11 04/25] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
On Mon, Dec 16, 2019 at 2:26 PM John Hubbard wrote: > > An upcoming patch changes and complicates the refcounting and > especially the "put page" aspects of it. In order to keep > everything clean, refactor the devmap page release routines: > > * Rename put_devmap_managed_page() to page_is_devmap_managed(), > and limit the functionality to "read only": return a bool, > with no side effects. > > * Add a new routine, put_devmap_managed_page(), to handle checking > what kind of page it is, and what kind of refcount handling it > requires. > > * Rename __put_devmap_managed_page() to free_devmap_managed_page(), > and limit the functionality to unconditionally freeing a devmap > page. > > This is originally based on a separate patch by Ira Weiny, which > applied to an early version of the put_user_page() experiments. > Since then, Jérôme Glisse suggested the refactoring described above. > > Cc: Christoph Hellwig > Suggested-by: Jérôme Glisse > Reviewed-by: Dan Williams > Reviewed-by: Jan Kara > Signed-off-by: Ira Weiny > Signed-off-by: John Hubbard > --- > include/linux/mm.h | 17 + > mm/memremap.c | 16 ++-- > mm/swap.c | 24 > 3 files changed, 39 insertions(+), 18 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index c97ea3b694e6..77a4df06c8a7 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -952,9 +952,10 @@ static inline bool is_zone_device_page(const struct page > *page) > #endif > > #ifdef CONFIG_DEV_PAGEMAP_OPS > -void __put_devmap_managed_page(struct page *page); > +void free_devmap_managed_page(struct page *page); > DECLARE_STATIC_KEY_FALSE(devmap_managed_key); > -static inline bool put_devmap_managed_page(struct page *page) > + > +static inline bool page_is_devmap_managed(struct page *page) > { > if (!static_branch_unlikely(&devmap_managed_key)) > return false; > @@ -963,7 +964,6 @@ static inline bool put_devmap_managed_page(struct page > *page) > switch (page->pgmap->type) { > case MEMORY_DEVICE_PRIVATE: > case MEMORY_DEVICE_FS_DAX: > - __put_devmap_managed_page(page); > return true; > default: > break; > @@ -971,7 +971,14 @@ static inline bool put_devmap_managed_page(struct page > *page) > return false; > } > > +bool put_devmap_managed_page(struct page *page); > + > #else /* CONFIG_DEV_PAGEMAP_OPS */ > +static inline bool page_is_devmap_managed(struct page *page) > +{ > + return false; > +} > + > static inline bool put_devmap_managed_page(struct page *page) > { > return false; > @@ -1028,8 +1035,10 @@ static inline void put_page(struct page *page) > * need to inform the device driver through callback. See > * include/linux/memremap.h and HMM for details. > */ > - if (put_devmap_managed_page(page)) > + if (page_is_devmap_managed(page)) { > + put_devmap_managed_page(page); > return; > + } > > if (put_page_testzero(page)) > __put_page(page); > diff --git a/mm/memremap.c b/mm/memremap.c > index e899fa876a62..2ba773859031 100644 > --- a/mm/memremap.c > +++ b/mm/memremap.c > @@ -411,20 +411,8 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn, > EXPORT_SYMBOL_GPL(get_dev_pagemap); > > #ifdef CONFIG_DEV_PAGEMAP_OPS > -void __put_devmap_managed_page(struct page *page) > +void free_devmap_managed_page(struct page *page) > { > - int count = page_ref_dec_return(page); > - > - /* still busy */ > - if (count > 1) > - return; > - > - /* only triggered by the dev_pagemap shutdown path */ > - if (count == 0) { > - __put_page(page); > - return; > - } > - > /* notify page idle for dax */ > if (!is_device_private_page(page)) { > wake_up_var(&page->_refcount); > @@ -461,5 +449,5 @@ void __put_devmap_managed_page(struct page *page) > page->mapping = NULL; > page->pgmap->ops->page_free(page); > } > -EXPORT_SYMBOL(__put_devmap_managed_page); > +EXPORT_SYMBOL(free_devmap_managed_page); This patch does not have a module consumer for free_devmap_managed_page(), so the export should move to the patch that needs the new export. Also the only reason that put_devmap_managed_page() is EXPORT_SYMBOL instead of EXPORT_SYMBOL_GPL is that there was no practical way to hide the devmap details from evey module in the kernel that did put_page(). I would expect free_devmap_managed_page() to EXPORT_SYMBOL_GPL if it is not inlined into an existing exported static inline api. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v11 04/25] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
On Wed, Dec 18, 2019 at 9:51 PM John Hubbard wrote: > > On 12/18/19 9:27 PM, Dan Williams wrote: > ... > >> @@ -461,5 +449,5 @@ void __put_devmap_managed_page(struct page *page) > >> page->mapping = NULL; > >> page->pgmap->ops->page_free(page); > >> } > >> -EXPORT_SYMBOL(__put_devmap_managed_page); > >> +EXPORT_SYMBOL(free_devmap_managed_page); > > > > This patch does not have a module consumer for > > free_devmap_managed_page(), so the export should move to the patch > > that needs the new export. > > Hi Dan, > > OK, I know that's a policy--although it seems quite pointless here given > that this is definitely going to need an EXPORT. > > At the moment, the series doesn't use it in any module at all, so I'll just > delete the EXPORT for now. > > > > > Also the only reason that put_devmap_managed_page() is EXPORT_SYMBOL > > instead of EXPORT_SYMBOL_GPL is that there was no practical way to > > hide the devmap details from evey module in the kernel that did > > put_page(). I would expect free_devmap_managed_page() to > > EXPORT_SYMBOL_GPL if it is not inlined into an existing exported > > static inline api. > > > > Sure, I'll change it to EXPORT_SYMBOL_GPL when the time comes. We do have > to be careful that we don't shut out normal put_page() types of callers, > but...glancing through the current callers, that doesn't look to be a problem. > Good. So it should be OK to do EXPORT_SYMBOL_GPL here. > > Are you *sure* you don't want to just pre-emptively EXPORT now, and save > looking at it again? I'm positive. There is enough history for "trust me the consumer is coming" turning out not to be true to justify the hassle in my mind. I do trust you, but things happen. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
On Fri, Dec 20, 2019 at 5:34 AM Jason Gunthorpe wrote: > > On Thu, Dec 19, 2019 at 01:13:54PM -0800, John Hubbard wrote: > > On 12/19/19 1:07 PM, Jason Gunthorpe wrote: > > > On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote: > > > > On 12/19/19 5:26 AM, Leon Romanovsky wrote: > > > > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote: > > > > > > Hi, > > > > > > > > > > > > This implements an API naming change (put_user_page*() --> > > > > > > unpin_user_page*()), and also implements tracking of FOLL_PIN > > > > > > pages. It > > > > > > extends that tracking to a few select subsystems. More subsystems > > > > > > will > > > > > > be added in follow up work. > > > > > > > > > > Hi John, > > > > > > > > > > The patchset generates kernel panics in our IB testing. In our tests, > > > > > we > > > > > allocated single memory block and registered multiple MRs using the > > > > > single > > > > > block. > > > > > > > > > > The possible bad flow is: > > > > >ib_umem_geti() -> > > > > > pin_user_pages_fast(FOLL_WRITE) -> > > > > > internal_get_user_pages_fast(FOLL_WRITE) -> > > > > > gup_pgd_range() -> > > > > >gup_huge_pd() -> > > > > > gup_hugepte() -> > > > > > try_grab_compound_head() -> > > > > > > > > Hi Leon, > > > > > > > > Thanks very much for the detailed report! So we're overflowing... > > > > > > > > At first look, this seems likely to be hitting a weak point in the > > > > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be > > > > deferred > > > > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines > > > > 99-121). Basically it's pretty easy to overflow the page->_refcount > > > > with huge pages if the pages have a *lot* of subpages. > > > > > > > > We can only do about 7 pins on 1GB huge pages that use 4KB subpages. > > > > > > Considering that establishing these pins is entirely under user > > > control, we can't have a limit here. > > > > There's already a limit, it's just a much larger one. :) What does "no > > limit" > > really mean, numerically, to you in this case? > > I guess I mean 'hidden limit' - hitting the limit and failing would > be managable. > > I think 7 is probably too low though, but we are not using 1GB huge > pages, only 2M.. What about RDMA to 1GB-hugetlbfs and 1GB-device-dax mappings? ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
On Fri, Dec 20, 2019 at 1:22 AM Jan Kara wrote: > > On Thu 19-12-19 12:30:31, John Hubbard wrote: > > On 12/19/19 5:26 AM, Leon Romanovsky wrote: > > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote: > > > > Hi, > > > > > > > > This implements an API naming change (put_user_page*() --> > > > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It > > > > extends that tracking to a few select subsystems. More subsystems will > > > > be added in follow up work. > > > > > > Hi John, > > > > > > The patchset generates kernel panics in our IB testing. In our tests, we > > > allocated single memory block and registered multiple MRs using the single > > > block. > > > > > > The possible bad flow is: > > > ib_umem_geti() -> > > >pin_user_pages_fast(FOLL_WRITE) -> > > > internal_get_user_pages_fast(FOLL_WRITE) -> > > > gup_pgd_range() -> > > > gup_huge_pd() -> > > >gup_hugepte() -> > > > try_grab_compound_head() -> > > > > Hi Leon, > > > > Thanks very much for the detailed report! So we're overflowing... > > > > At first look, this seems likely to be hitting a weak point in the > > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred > > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines > > 99-121). Basically it's pretty easy to overflow the page->_refcount > > with huge pages if the pages have a *lot* of subpages. > > > > We can only do about 7 pins on 1GB huge pages that use 4KB subpages. > > Do you have any idea how many pins (repeated pins on the same page, which > > it sounds like you have) might be involved in your test case, > > and the huge page and system page sizes? That would allow calculating > > if we're likely overflowing for that reason. > > > > So, ideas and next steps: > > > > 1. Assuming that you *are* hitting this, I think I may have to fall back to > > implementing the "deferred" part of this design, as part of this series, > > after > > all. That means: > > > > For the pin/unpin calls at least, stop treating all pages as if they are > > a cluster of PAGE_SIZE pages; instead, retrieve a huge page as one page. > > That's not how it works now, and the need to hand back a huge array of > > subpages is part of the problem. This affects the callers too, so it's not > > a super quick change to make. (I was really hoping not to have to do this > > yet.) > > Does that mean that you would need to make all GUP users huge page aware? > Otherwise I don't see how what you suggest would work... And I don't think > making all GUP users huge page aware is realistic (effort-wise) or even > wanted (maintenance overhead in all those places). > > I believe there might be also a different solution for this: For > transparent huge pages, we could find a space in 'struct page' of the > second page in the huge page for proper pin counter and just account pins > there so we'd have full width of 32-bits for it. That would require THP accounting for dax pages. It is something that was probably going to be needed, but this would seem to force the issue. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
On Fri, Dec 20, 2019 at 4:41 PM John Hubbard wrote: > > On 12/20/19 4:33 PM, Dan Williams wrote: > ... > >> I believe there might be also a different solution for this: For > >> transparent huge pages, we could find a space in 'struct page' of the > >> second page in the huge page for proper pin counter and just account pins > >> there so we'd have full width of 32-bits for it. > > > > That would require THP accounting for dax pages. It is something that > > was probably going to be needed, but this would seem to force the > > issue. > > > > Thanks for mentioning that, it wasn't obvious to me yet. > > How easy is it for mere mortals outside of Intel, to set up a DAX (nvdimm?) > test setup? I'd hate to go into this without having that coverage up > and running. It's been sketchy enough as it is. :) You too can have the power of the gods for the low low price of a kernel command line parameter, or a qemu setup. Details here: https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system https://nvdimm.wiki.kernel.org/pmem_in_qemu ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v7 05/24] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
On Thu, Nov 21, 2019 at 12:57 AM John Hubbard wrote: > > On 11/21/19 12:05 AM, Christoph Hellwig wrote: > > So while this looks correct and I still really don't see the major > > benefit of the new code organization, especially as it bloats all > > put_page callers. > > > > I'd love to see code size change stats for an allyesconfig on this > > commit. > > > > Right, I'm running that now, will post the results. (btw, if there is > a script and/or standard format I should use, I'm all ears. I'll dig > through lwn...) > Just run: size vmlinux ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk
On Wed, Aug 7, 2019 at 10:45 AM Jason Gunthorpe wrote: > > On Tue, Aug 06, 2019 at 07:05:42PM +0300, Christoph Hellwig wrote: > > There is only a single place where the pgmap is passed over a function > > call, so replace it with local variables in the places where we deal > > with the pgmap. > > > > Signed-off-by: Christoph Hellwig > > mm/hmm.c | 62 > > 1 file changed, 27 insertions(+), 35 deletions(-) > > > > diff --git a/mm/hmm.c b/mm/hmm.c > > index 9a908902e4cc..d66fa29b42e0 100644 > > +++ b/mm/hmm.c > > @@ -278,7 +278,6 @@ EXPORT_SYMBOL(hmm_mirror_unregister); > > > > struct hmm_vma_walk { > > struct hmm_range*range; > > - struct dev_pagemap *pgmap; > > unsigned long last; > > unsigned intflags; > > }; > > @@ -475,6 +474,7 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk, > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > > struct hmm_vma_walk *hmm_vma_walk = walk->private; > > struct hmm_range *range = hmm_vma_walk->range; > > + struct dev_pagemap *pgmap = NULL; > > unsigned long pfn, npages, i; > > bool fault, write_fault; > > uint64_t cpu_flags; > > @@ -490,17 +490,14 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk, > > pfn = pmd_pfn(pmd) + pte_index(addr); > > for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) { > > if (pmd_devmap(pmd)) { > > - hmm_vma_walk->pgmap = get_dev_pagemap(pfn, > > - hmm_vma_walk->pgmap); > > - if (unlikely(!hmm_vma_walk->pgmap)) > > + pgmap = get_dev_pagemap(pfn, pgmap); > > + if (unlikely(!pgmap)) > > return -EBUSY; > > Unrelated to this patch, but what is the point of getting checking > that the pgmap exists for the page and then immediately releasing it? > This code has this pattern in several places. > > It feels racy Agree, not sure what the intent is here. The only other reason call get_dev_pagemap() is to just check in general if the pfn is indeed owned by some ZONE_DEVICE instance, but if the intent is to make sure the device is still attached/enabled that check is invalidated at put_dev_pagemap(). If it's the former case, validating ZONE_DEVICE pfns, I imagine we can do something cheaper with a helper that is on the order of the same cost as pfn_valid(). I.e. replace PTE_DEVMAP with a mem_section flag or something similar. > > > } > > pfns[i] = hmm_device_entry_from_pfn(range, pfn) | cpu_flags; > > } > > - if (hmm_vma_walk->pgmap) { > > - put_dev_pagemap(hmm_vma_walk->pgmap); > > - hmm_vma_walk->pgmap = NULL; > > Putting the value in the hmm_vma_walk would have made some sense to me > if the pgmap was not set to NULL all over the place. Then the most > xa_loads would be eliminated, as I would expect the pgmap tends to be > mostly uniform for these use cases. > > Is there some reason the pgmap ref can't be held across > faulting/sleeping? ie like below. No restriction on holding refs over faulting / sleeping. > > Anyhow, I looked over this pretty carefully and the change looks > functionally OK, I just don't know why the code is like this in the > first place. > > Reviewed-by: Jason Gunthorpe > > diff --git a/mm/hmm.c b/mm/hmm.c > index 9a908902e4cc38..4e30128c23a505 100644 > --- a/mm/hmm.c > +++ b/mm/hmm.c > @@ -497,10 +497,6 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk, > } > pfns[i] = hmm_device_entry_from_pfn(range, pfn) | cpu_flags; > } > - if (hmm_vma_walk->pgmap) { > - put_dev_pagemap(hmm_vma_walk->pgmap); > - hmm_vma_walk->pgmap = NULL; > - } > hmm_vma_walk->last = end; > return 0; > #else > @@ -604,10 +600,6 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, > unsigned long addr, > return 0; > > fault: > - if (hmm_vma_walk->pgmap) { > - put_dev_pagemap(hmm_vma_walk->pgmap); > - hmm_vma_walk->pgmap = NULL; > - } > pte_unmap(ptep); > /* Fault any virtual address we were asked to fault */ > return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk); > @@ -690,16 +682,6 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, > return r; > } > } > - if (hmm_vma_walk->pgmap) { > - /* > -* We do put_dev_pagemap() here and not in > hmm_vma_handle_pte() > -* so that we can leverage get_dev_pagemap() optimization > which > -* will not re-take a reference on a pgmap if we already have > -* one. > -*/ > - put_dev_pagemap(hmm_vma_walk->pgmap); > - hmm_vma_walk->pgmap = NULL; > - } > pte_unmap(ptep - 1); > >
Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk
On Wed, Aug 7, 2019 at 11:59 PM Christoph Hellwig wrote: > > On Wed, Aug 07, 2019 at 11:47:22AM -0700, Dan Williams wrote: > > > Unrelated to this patch, but what is the point of getting checking > > > that the pgmap exists for the page and then immediately releasing it? > > > This code has this pattern in several places. > > > > > > It feels racy > > > > Agree, not sure what the intent is here. The only other reason call > > get_dev_pagemap() is to just check in general if the pfn is indeed > > owned by some ZONE_DEVICE instance, but if the intent is to make sure > > the device is still attached/enabled that check is invalidated at > > put_dev_pagemap(). > > > > If it's the former case, validating ZONE_DEVICE pfns, I imagine we can > > do something cheaper with a helper that is on the order of the same > > cost as pfn_valid(). I.e. replace PTE_DEVMAP with a mem_section flag > > or something similar. > > The hmm literally never dereferences the pgmap, so validity checking is > the only explanation for it. > > > > + /* > > > +* We do put_dev_pagemap() here so that we can leverage > > > +* get_dev_pagemap() optimization which will not re-take a > > > +* reference on a pgmap if we already have one. > > > +*/ > > > + if (hmm_vma_walk->pgmap) > > > + put_dev_pagemap(hmm_vma_walk->pgmap); > > > + > > > > Seems ok, but only if the caller is guaranteeing that the range does > > not span outside of a single pagemap instance. If that guarantee is > > met why not just have the caller pass in a pinned pagemap? If that > > guarantee is not met, then I think we're back to your race concern. > > It iterates over multiple ptes in a non-huge pmd. Is there any kind of > limitations on different pgmap instances inside a pmd? I can't think > of one, so this might actually be a bug. Section alignment constraints somewhat save us here. The only example I can think of a PMD not containing a uniform pgmap association for each pte is the case when the pgmap overlaps normal dram, i.e. shares the same 'struct memory_section' for a given span. Otherwise, distinct pgmaps arrange to manage their own exclusive sections (and now subsections as of v5.3). Otherwise the implementation could not guarantee different mapping lifetimes. That said, this seems to want a better mechanism to determine "pfn is ZONE_DEVICE". ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk
On Wed, Aug 14, 2019 at 6:28 AM Jason Gunthorpe wrote: > > On Wed, Aug 14, 2019 at 09:38:54AM +0200, Christoph Hellwig wrote: > > On Tue, Aug 13, 2019 at 06:36:33PM -0700, Dan Williams wrote: > > > Section alignment constraints somewhat save us here. The only example > > > I can think of a PMD not containing a uniform pgmap association for > > > each pte is the case when the pgmap overlaps normal dram, i.e. shares > > > the same 'struct memory_section' for a given span. Otherwise, distinct > > > pgmaps arrange to manage their own exclusive sections (and now > > > subsections as of v5.3). Otherwise the implementation could not > > > guarantee different mapping lifetimes. > > > > > > That said, this seems to want a better mechanism to determine "pfn is > > > ZONE_DEVICE". > > > > So I guess this patch is fine for now, and once you provide a better > > mechanism we can switch over to it? > > What about the version I sent to just get rid of all the strange > put_dev_pagemaps while scanning? Odds are good we will work with only > a single pagemap, so it makes some sense to cache it once we find it? Yes, if the scan is over a single pmd then caching it makes sense. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk
On Thu, Aug 15, 2019 at 11:07 AM Jerome Glisse wrote: > > On Wed, Aug 14, 2019 at 07:48:28AM -0700, Dan Williams wrote: > > On Wed, Aug 14, 2019 at 6:28 AM Jason Gunthorpe wrote: > > > > > > On Wed, Aug 14, 2019 at 09:38:54AM +0200, Christoph Hellwig wrote: > > > > On Tue, Aug 13, 2019 at 06:36:33PM -0700, Dan Williams wrote: > > > > > Section alignment constraints somewhat save us here. The only example > > > > > I can think of a PMD not containing a uniform pgmap association for > > > > > each pte is the case when the pgmap overlaps normal dram, i.e. shares > > > > > the same 'struct memory_section' for a given span. Otherwise, distinct > > > > > pgmaps arrange to manage their own exclusive sections (and now > > > > > subsections as of v5.3). Otherwise the implementation could not > > > > > guarantee different mapping lifetimes. > > > > > > > > > > That said, this seems to want a better mechanism to determine "pfn is > > > > > ZONE_DEVICE". > > > > > > > > So I guess this patch is fine for now, and once you provide a better > > > > mechanism we can switch over to it? > > > > > > What about the version I sent to just get rid of all the strange > > > put_dev_pagemaps while scanning? Odds are good we will work with only > > > a single pagemap, so it makes some sense to cache it once we find it? > > > > Yes, if the scan is over a single pmd then caching it makes sense. > > Quite frankly an easier an better solution is to remove the pagemap > lookup as HMM user abide by mmu notifier it means we will not make > use or dereference the struct page so that we are safe from any > racing hotunplug of dax memory (as long as device driver using hmm > do not have a bug). Yes, as long as the driver remove is synchronized against HMM operations via another mechanism then there is no need to take pagemap references. Can you briefly describe what that other mechanism is? ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk
On Thu, Aug 15, 2019 at 12:44 PM Jerome Glisse wrote: > > On Thu, Aug 15, 2019 at 12:36:58PM -0700, Dan Williams wrote: > > On Thu, Aug 15, 2019 at 11:07 AM Jerome Glisse wrote: > > > > > > On Wed, Aug 14, 2019 at 07:48:28AM -0700, Dan Williams wrote: > > > > On Wed, Aug 14, 2019 at 6:28 AM Jason Gunthorpe > > > > wrote: > > > > > > > > > > On Wed, Aug 14, 2019 at 09:38:54AM +0200, Christoph Hellwig wrote: > > > > > > On Tue, Aug 13, 2019 at 06:36:33PM -0700, Dan Williams wrote: > > > > > > > Section alignment constraints somewhat save us here. The only > > > > > > > example > > > > > > > I can think of a PMD not containing a uniform pgmap association > > > > > > > for > > > > > > > each pte is the case when the pgmap overlaps normal dram, i.e. > > > > > > > shares > > > > > > > the same 'struct memory_section' for a given span. Otherwise, > > > > > > > distinct > > > > > > > pgmaps arrange to manage their own exclusive sections (and now > > > > > > > subsections as of v5.3). Otherwise the implementation could not > > > > > > > guarantee different mapping lifetimes. > > > > > > > > > > > > > > That said, this seems to want a better mechanism to determine > > > > > > > "pfn is > > > > > > > ZONE_DEVICE". > > > > > > > > > > > > So I guess this patch is fine for now, and once you provide a better > > > > > > mechanism we can switch over to it? > > > > > > > > > > What about the version I sent to just get rid of all the strange > > > > > put_dev_pagemaps while scanning? Odds are good we will work with only > > > > > a single pagemap, so it makes some sense to cache it once we find it? > > > > > > > > Yes, if the scan is over a single pmd then caching it makes sense. > > > > > > Quite frankly an easier an better solution is to remove the pagemap > > > lookup as HMM user abide by mmu notifier it means we will not make > > > use or dereference the struct page so that we are safe from any > > > racing hotunplug of dax memory (as long as device driver using hmm > > > do not have a bug). > > > > Yes, as long as the driver remove is synchronized against HMM > > operations via another mechanism then there is no need to take pagemap > > references. Can you briefly describe what that other mechanism is? > > So if you hotunplug some dax memory i assume that this can only > happens once all the pages are unmapped (as it must have the > zero refcount, well 1 because of the bias) and any unmap will > trigger a mmu notifier callback. User of hmm mirror abiding by > the API will never make use of information they get through the > fault or snapshot function until checking for racing notifier > under lock. Hmm that first assumption is not guaranteed by the dev_pagemap core. The dev_pagemap end of life model is "disable, invalidate, drain" so it's possible to call devm_munmap_pages() while pages are still mapped it just won't complete the teardown of the pagemap until the last reference is dropped. New references are blocked during this teardown. However, if the driver is validating the liveness of the mapping in the mmu-notifier path and blocking new references it sounds like it should be ok. Might there be GPU driver unit tests that cover this racing teardown case? ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk
On Thu, Aug 15, 2019 at 1:41 PM Jason Gunthorpe wrote: > > On Thu, Aug 15, 2019 at 04:33:06PM -0400, Jerome Glisse wrote: > > > So nor HMM nor driver should dereference the struct page (i do not > > think any iommu driver would either), > > Er, they do technically deref the struct page: > > nouveau_dmem_convert_pfn(struct nouveau_drm *drm, > struct hmm_range *range) > struct page *page; > page = hmm_pfn_to_page(range, range->pfns[i]); > if (!nouveau_dmem_page(drm, page)) { > > > nouveau_dmem_page(struct nouveau_drm *drm, struct page *page) > { > return is_device_private_page(page) && drm->dmem == page_to_dmem(page) > > > Which does touch 'page->pgmap' > > Is this OK without having a get_dev_pagemap() ? > > Noting that the collision-retry scheme doesn't protect anything here > as we can have a concurrent invalidation while doing the above deref. As long take_driver_page_table_lock() in Jerome's flow can replace percpu_ref_tryget_live() on the pagemap reference. It seems nouveau_dmem_convert_pfn() happens after: mutex_lock(&svmm->mutex); if (!nouveau_range_done(&range)) { ...so I would expect that to be functionally equivalent to validating the reference count. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk
On Thu, Aug 15, 2019 at 5:41 PM Jason Gunthorpe wrote: > > On Thu, Aug 15, 2019 at 01:47:12PM -0700, Dan Williams wrote: > > On Thu, Aug 15, 2019 at 1:41 PM Jason Gunthorpe wrote: > > > > > > On Thu, Aug 15, 2019 at 04:33:06PM -0400, Jerome Glisse wrote: > > > > > > > So nor HMM nor driver should dereference the struct page (i do not > > > > think any iommu driver would either), > > > > > > Er, they do technically deref the struct page: > > > > > > nouveau_dmem_convert_pfn(struct nouveau_drm *drm, > > > struct hmm_range *range) > > > struct page *page; > > > page = hmm_pfn_to_page(range, range->pfns[i]); > > > if (!nouveau_dmem_page(drm, page)) { > > > > > > > > > nouveau_dmem_page(struct nouveau_drm *drm, struct page *page) > > > { > > > return is_device_private_page(page) && drm->dmem == > > > page_to_dmem(page) > > > > > > > > > Which does touch 'page->pgmap' > > > > > > Is this OK without having a get_dev_pagemap() ? > > > > > > Noting that the collision-retry scheme doesn't protect anything here > > > as we can have a concurrent invalidation while doing the above deref. > > > > As long take_driver_page_table_lock() in Jerome's flow can replace > > percpu_ref_tryget_live() on the pagemap reference. It seems > > nouveau_dmem_convert_pfn() happens after: > > > > mutex_lock(&svmm->mutex); > > if (!nouveau_range_done(&range)) { > > > > ...so I would expect that to be functionally equivalent to validating > > the reference count. > > Yes, OK, that makes sense, I was mostly surprised by the statement the > driver doesn't touch the struct page.. > > I suppose "doesn't touch the struct page out of the driver lock" is > the case. > > However, this means we cannot do any processing of ZONE_DEVICE pages > outside the driver lock, so eg, doing any DMA map that might rely on > MEMORY_DEVICE_PCI_P2PDMA has to be done in the driver lock, which is > a bit unfortunate. Wouldn't P2PDMA use page pins? Not needing to hold a lock over ZONE_DEVICE page operations was one of the motivations for plumbing get_dev_pagemap() with a percpu-ref. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk
On Fri, Aug 16, 2019 at 5:24 AM Jason Gunthorpe wrote: > > On Thu, Aug 15, 2019 at 08:54:46PM -0700, Dan Williams wrote: > > > > However, this means we cannot do any processing of ZONE_DEVICE pages > > > outside the driver lock, so eg, doing any DMA map that might rely on > > > MEMORY_DEVICE_PCI_P2PDMA has to be done in the driver lock, which is > > > a bit unfortunate. > > > > Wouldn't P2PDMA use page pins? Not needing to hold a lock over > > ZONE_DEVICE page operations was one of the motivations for plumbing > > get_dev_pagemap() with a percpu-ref. > > hmm_range_fault() doesn't use page pins at all, so if a ZONE_DEVICE > page comes out of it then it needs to use another locking pattern. > > If I follow it all right: > > We can do a get_dev_pagemap inside the page_walk and touch the pgmap, > or we can do the 'device mutex && retry' pattern and touch the pgmap > in the driver, under that lock. > > However in all cases the current get_dev_pagemap()'s in the page walk > are not necessary, and we can delete them. Yes, as long as 'struct page' instances resulting from that lookup are not passed outside of that lock. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
RE: [Lsf-pc] [LSF/MM/BPF proposal]: Physr discussion
Jason Gunthorpe via Lsf-pc wrote: > I would like to have a session at LSF to talk about Matthew's > physr discussion starter: > > https://lore.kernel.org/linux-mm/ydykweu0htv8m...@casper.infradead.org/ > > I have become interested in this with some immediacy because of > IOMMUFD and this other discussion with Christoph: > > > https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_...@nvidia.com/ I think this is a worthwhile discussion. My main hangup with 'struct page' elimination in general is that if anything needs to be allocated to describe a physical address for other parts of the kernel to operate on it, why not a 'struct page'? There are of course several difficulties allocating a 'struct page' array, but I look at subsection support and the tail page space optimization work as evidence that some of the pain can be mitigated, what more needs to be done? I also think this is somewhat of a separate consideration than replacing a bio_vec with phyr where that has value independent of the mechanism used to manage phys_addr_t => dma_addr_t. > Which results in, more or less, we have no way to do P2P DMA > operations without struct page - and from the RDMA side solving this > well at the DMA API means advancing at least some part of the physr > idea. > > So - my objective is to enable to DMA API to "DMA map" something that > is not a scatterlist, may or may not contain struct pages, but can > still contain P2P DMA data. From there I would move RDMA MR's to use > this new API, modify DMABUF to export it, complete the above VFIO > series, and finally, use all of this to add back P2P support to VFIO > when working with IOMMUFD by allowing IOMMUFD to obtain a safe > reference to the VFIO memory using DMABUF. From there we'd want to see > pin_user_pages optimized, and that also will need some discussion how > best to structure it. > > I also have several ideas on how something like physr can optimize the > iommu driver ops when working with dma-iommu.c and IOMMUFD. > > I've been working on an implementation and hope to have something > draft to show on the lists in a few weeks. It is pretty clear there > are several interesting decisions to make that I think will benefit > from a live discussion. > > Providing a kernel-wide alternative to scatterlist is something that > has general interest across all the driver subsystems. I've started to > view the general problem rather like xarray where the main focus is to > create the appropriate abstraction and then go about transforming > users to take advatange of the cleaner abstraction. scatterlist > suffers here because it has an incredibly leaky API, a huge number of > (often sketchy driver) users, and has historically been very difficult > to improve. When I read "general interest across all the driver subsystems" it is hard not to ask "have all possible avenues to enable 'struct page' been exhausted?" > The session would quickly go over the current state of whatever the > mailing list discussion evolves into and an open discussion around the > different ideas. Sounds good to me.
Re: [Lsf-pc] [LSF/MM/BPF proposal]: Physr discussion
Matthew Wilcox wrote: > On Mon, Jan 23, 2023 at 11:36:51AM -0800, Dan Williams wrote: > > Jason Gunthorpe via Lsf-pc wrote: > > > I would like to have a session at LSF to talk about Matthew's > > > physr discussion starter: > > > > > > https://lore.kernel.org/linux-mm/ydykweu0htv8m...@casper.infradead.org/ > > > > > > I have become interested in this with some immediacy because of > > > IOMMUFD and this other discussion with Christoph: > > > > > > > > > https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_...@nvidia.com/ > > > > I think this is a worthwhile discussion. My main hangup with 'struct > > page' elimination in general is that if anything needs to be allocated > > You're the first one to bring up struct page elimination. Neither Jason > nor I have that as our motivation. Oh, ok, then maybe I misread the concern in the vfio discussion. I thought the summary there is debating the ongoing requirement for 'struct page' for P2PDMA?
[PATCH] mm/memremap: Introduce pgmap_request_folio() using pgmap offsets
A 'struct dev_pagemap' (pgmap) represents a collection of ZONE_DEVICE pages. The pgmap is a reference counted object that serves a similar role as a 'struct request_queue'. Live references are obtained for each in flight request / page, and once a page's reference count drops to zero the associated pin of the pgmap is dropped as well. While a page is idle nothing should be accessing it because that is effectively a use-after-free situation. Unfortunately, all current ZONE_DEVICE implementations deploy a layering violation to manage requests to activate pages owned by a pgmap. Specifically, they take steps like walk the pfns that were previously assigned at memremap_pages() time and use pfn_to_page() to recall metadata like page->pgmap, or make use of other data like page->zone_device_data. The first step towards correcting that situation is to provide a API to get access to a pgmap page that does not require the caller to know the pfn, nor access any fields of an idle page. Ideally this API would be able to support dynamic page creation instead of the current status quo of pre-allocating and initializing pages. On a prompt from Jason, introduce pgmap_request_folio() that operates on an offset into a pgmap. It replaces the shortlived pgmap_request_folios() that was continuing the layering violation of assuming pages are available to be consulted before asking the pgmap to make them available. For now this only converts the callers to lookup the pgmap and generate the pgmap offset, but it does not do the deeper cleanup of teaching those call sites to generate those arguments without walking the page metadata. For next steps it appears the DEVICE_PRIVATE implementations could plumb the pgmap into the necessary callsites and switch to using gen_pool_alloc() to track which offsets of a pgmap are allocated. For DAX, dax_direct_access() could switch from returning pfns to returning the associated @pgmap and @pgmap_offset. Those changes are saved for follow-on work. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Christoph Hellwig Cc: John Hubbard Cc: Alistair Popple Cc: Felix Kuehling Cc: Alex Deucher Cc: "Christian König" Cc: "Pan, Xinhui" Cc: David Airlie Cc: Daniel Vetter Cc: Ben Skeggs Cc: Karol Herbst Cc: Lyude Paul Cc: "Jérôme Glisse" Suggested-by: Jason Gunthorpe Signed-off-by: Dan Williams --- This builds on the dax reference counting reworks in mm-unstable. arch/powerpc/kvm/book3s_hv_uvmem.c | 11 ++-- drivers/dax/mapping.c| 10 +++ drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 14 +++-- drivers/gpu/drm/nouveau/nouveau_dmem.c | 13 +++- include/linux/memremap.h | 35 --- lib/test_hmm.c |9 +++ mm/memremap.c| 92 -- 7 files changed, 106 insertions(+), 78 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c index 884ec112ad43..2ea59396f608 100644 --- a/arch/powerpc/kvm/book3s_hv_uvmem.c +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c @@ -689,12 +689,14 @@ unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm) */ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm) { - struct page *dpage = NULL; + struct dev_pagemap *pgmap = &kvmppc_uvmem_pgmap; unsigned long bit, uvmem_pfn; struct kvmppc_uvmem_page_pvt *pvt; unsigned long pfn_last, pfn_first; + struct folio *folio; + struct page *dpage; - pfn_first = kvmppc_uvmem_pgmap.range.start >> PAGE_SHIFT; + pfn_first = pgmap->range.start >> PAGE_SHIFT; pfn_last = pfn_first + (range_len(&kvmppc_uvmem_pgmap.range) >> PAGE_SHIFT); @@ -716,9 +718,10 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm) pvt->gpa = gpa; pvt->kvm = kvm; - dpage = pfn_to_page(uvmem_pfn); + folio = pgmap_request_folio(pgmap, + pfn_to_pgmap_offset(pgmap, uvmem_pfn), 0); + dpage = &folio->page; dpage->zone_device_data = pvt; - pgmap_request_folios(dpage->pgmap, page_folio(dpage), 1); lock_page(dpage); return dpage; out_clear: diff --git a/drivers/dax/mapping.c b/drivers/dax/mapping.c index ca06f2515644..b885c75e2dfb 100644 --- a/drivers/dax/mapping.c +++ b/drivers/dax/mapping.c @@ -376,8 +376,14 @@ static vm_fault_t dax_associate_entry(void *entry, if (flags & DAX_COW) { dax_mapping_set_cow(folio); } else { + struct dev_pagemap *pgmap = folio_pgmap(folio); + unsigned long pfn = page_to_pfn(&folio->page); + WARN_ON_ONCE(folio->mapping); - if (!pgmap_req
Re: [PATCH] mm/memremap: Introduce pgmap_request_folio() using pgmap offsets
Felix Kuehling wrote: > Am 2022-10-20 um 17:56 schrieb Dan Williams: > > A 'struct dev_pagemap' (pgmap) represents a collection of ZONE_DEVICE > > pages. The pgmap is a reference counted object that serves a similar > > role as a 'struct request_queue'. Live references are obtained for each > > in flight request / page, and once a page's reference count drops to > > zero the associated pin of the pgmap is dropped as well. While a page is > > idle nothing should be accessing it because that is effectively a > > use-after-free situation. Unfortunately, all current ZONE_DEVICE > > implementations deploy a layering violation to manage requests to > > activate pages owned by a pgmap. Specifically, they take steps like walk > > the pfns that were previously assigned at memremap_pages() time and use > > pfn_to_page() to recall metadata like page->pgmap, or make use of other > > data like page->zone_device_data. > > > > The first step towards correcting that situation is to provide a > > API to get access to a pgmap page that does not require the caller to > > know the pfn, nor access any fields of an idle page. Ideally this API > > would be able to support dynamic page creation instead of the current > > status quo of pre-allocating and initializing pages. > > If I understand it correctly, the current code works because the struct > pages are pre-allocated. Is the end-goal here to make the struct page > allocation for ZONE_DEVICE pages dynamic. Some DEVICE_PRIVATE users have already open-coded their own coarse grained dynamic ZONE_DEVICE pages by waiting to allocate chunks on demand. The users that would benefit from a general dynamic ZONE_DEVICE facility are cases like VMs backed by device-dax instances. Unless the VM calls for bare metal services there is no need to map pages for the device-dax instance in the hypervisor. So, the end goal here is to just add some sanity to ZONE_DEVICE page reference counting to allow for requiring an arbitration for page access rather than just pfn_to_page() and assuming the page is already there. Dynamic ZONE_DEVICE becomes something that is possible once that sanity is in place. > > On a prompt from Jason, introduce pgmap_request_folio() that operates on > > an offset into a pgmap. > > This looks like it would make it fairly easy to request larger (higher > order) folios for physically contiguous device memory allocations in the > future. > > > > It replaces the shortlived > > pgmap_request_folios() that was continuing the layering violation of > > assuming pages are available to be consulted before asking the pgmap to > > make them available. > > > > For now this only converts the callers to lookup the pgmap and generate > > the pgmap offset, but it does not do the deeper cleanup of teaching > > those call sites to generate those arguments without walking the page > > metadata. For next steps it appears the DEVICE_PRIVATE implementations > > could plumb the pgmap into the necessary callsites and switch to using > > gen_pool_alloc() to track which offsets of a pgmap are allocated. > > Wouldn't that duplicate whatever device memory allocator we already have > in our driver? Couldn't I just take the memory allocation from our TTM > allocator and make necessary pgmap_request_folio calls to allocate the > corresponding pages from the pgmap? I think you could, as long as the output from that allocator is a pgmap_offset rather than a pfn. > Or does the pgmap allocation need a finer granularity than the device > memory allocation? I would say the pgmap *allocation* happens at memremap_pages() time. pgmap_request_folio() is a request to put a pgmap page into service. So, yes, I think you can bring your own allocator for what offsets are in/out of service in pgmap space.
Re: [PATCH] mm/memremap: Introduce pgmap_request_folio() using pgmap offsets
Jason Gunthorpe wrote: > On Thu, Oct 20, 2022 at 02:56:39PM -0700, Dan Williams wrote: > > A 'struct dev_pagemap' (pgmap) represents a collection of ZONE_DEVICE > > pages. The pgmap is a reference counted object that serves a similar > > role as a 'struct request_queue'. Live references are obtained for each > > in flight request / page, and once a page's reference count drops to > > zero the associated pin of the pgmap is dropped as well. While a page is > > idle nothing should be accessing it because that is effectively a > > use-after-free situation. Unfortunately, all current ZONE_DEVICE > > implementations deploy a layering violation to manage requests to > > activate pages owned by a pgmap. Specifically, they take steps like walk > > the pfns that were previously assigned at memremap_pages() time and use > > pfn_to_page() to recall metadata like page->pgmap, or make use of other > > data like page->zone_device_data. > > > > The first step towards correcting that situation is to provide a > > API to get access to a pgmap page that does not require the caller to > > know the pfn, nor access any fields of an idle page. Ideally this API > > would be able to support dynamic page creation instead of the current > > status quo of pre-allocating and initializing pages. > > > > On a prompt from Jason, introduce pgmap_request_folio() that operates on > > an offset into a pgmap. It replaces the shortlived > > pgmap_request_folios() that was continuing the layering violation of > > assuming pages are available to be consulted before asking the pgmap to > > make them available. > > > > For now this only converts the callers to lookup the pgmap and generate > > the pgmap offset, but it does not do the deeper cleanup of teaching > > those call sites to generate those arguments without walking the page > > metadata. For next steps it appears the DEVICE_PRIVATE implementations > > could plumb the pgmap into the necessary callsites and switch to using > > gen_pool_alloc() to track which offsets of a pgmap are allocated. For > > DAX, dax_direct_access() could switch from returning pfns to returning > > the associated @pgmap and @pgmap_offset. Those changes are saved for > > follow-on work. > > I like it, though it would be nice to see drivers converted away from > pfn_to_pgmap_offset().. I think since there is no urgent need for this series to move forward in v6.2 I can take the time to kill the need for pfn_to_pgmap_offset() and circle back for this in v6.3. The urgency in my mind is not there because: 1/ Physical memory-device-hotplug is not common, that does not arrive until CXL 2.0 hosts are shipping in volume. Note that's distinct from ACPI hotplug that is platform firmware coordinated. 2/ Beyond the initial growing pains with Folios and DAX-pages there are no additional collisions on the v6.2 horizon. 3/ I have not seen any MEMORY_DEVICE_PRIVATE users attempt the pfn_to_pgmap_offset() conversion, so no patches are dependent on this moving forward in v6.2. If someone sees an urgency reason I missed then let me know. > > /** > > - * pgmap_request_folios - activate an contiguous span of folios in @pgmap > > - * @pgmap: host page map for the folio array > > - * @folio: start of the folio list, all subsequent folios have same > > folio_size() > > + * pgmap_request_folio - activate a folio of a given order in @pgmap > > + * @pgmap: host page map of the folio to activate > > + * @pgmap_offset: page-offset into the pgmap to request > > + * @order: expected folio_order() of the folio > > * > > * Caller is responsible for @pgmap remaining live for the duration of > > - * this call. Caller is also responsible for not racing requests for the > > - * same folios. > > + * this call. The order (size) of the folios in the pgmap are assumed > > + * stable before this call. > > */ > > I would probably add some discussion here that this enables > refcounting on the folio and the pgmap_ops page free will be called > once the folio is no longer being used. > > And explain that the pgmap user is responsible for tracking which > pgmap_offsets are requested and which have been returned by free. It > would be nice to say that this can only be called on free'd folios. Ok. > > > -bool pgmap_request_folios(struct dev_pagemap *pgmap, struct folio *folio, > > - int nr_folios) > > +struct folio *pgmap_request_folio(struct dev_pagemap *pgmap, > > + pgoff_t pgmap_offset, int order) > > unsigned int order? Sure. > > > { > > - s
mm: fix cache mode tracking in vm_insert_mixed() breaks AMDGPU [was: Re: Latest testing with drm-next-4.9-wip and latest LLVM/mesa stack - Regression in PowerPlay/DPM on CIK?]
On Sun, Oct 16, 2016 at 1:53 PM, Dave Airlie wrote: > On 17 October 2016 at 04:41, Marek Olšák wrote: >> On Fri, Oct 14, 2016 at 3:33 AM, Michel Dänzer >> wrote: >>> >>> [ Adding Dan Williams and dri-devel ] >>> >>> On 14/10/16 03:28 AM, Shawn Starr wrote: >>>> Hello AMD folks, >>>> >>>> I have discovered a problem in Linus master that affects AMDGPU, nobody >>>> would >>>> notice this in drm-next-4.9-wip since its not in this repo. >>> >>> [...] >>> >>>> 87744ab3832b83ba71b931f86f9cfdb000d07da5 is the first bad commit >>>> commit 87744ab3832b83ba71b931f86f9cfdb000d07da5 >>>> Author: Dan Williams >>>> Date: Fri Oct 7 17:00:18 2016 -0700 >>>> >>>> mm: fix cache mode tracking in vm_insert_mixed() >>>> >>>> vm_insert_mixed() unlike vm_insert_pfn_prot() and vmf_insert_pfn_pmd(), >>>> fails to check the pgprot_t it uses for the mapping against the one >>>> recorded in the memtype tracking tree. Add the missing call to >>>> track_pfn_insert() to preclude cases where incompatible aliased >>>> mappings >>>> are established for a given physical address range. >>>> >>>> Link: http://lkml.kernel.org/r/ >>>> 147328717909.35069.14256589123570653697.stgit at dwillia2- >>>> desk3.amr.corp.intel.com >>>> Signed-off-by: Dan Williams >>>> Cc: David Airlie >>>> Cc: Matthew Wilcox >>>> Cc: Ross Zwisler >>>> Signed-off-by: Andrew Morton >>>> Signed-off-by: Linus Torvalds >>>> >>>> :04 04 7517c0019fe49c1830b5a1d81f1dc099c5aab98a >>>> fd497a604a2af5995db2b8ed1e9c640bede6adf3 M mm >>>> >>>> >>>> Removal of this patch stops graphics stalls. >>> >>> Thanks for bisecting this Shawn. >>> >>> >>>> A friend of mine mentions, >>>> >>>> "looks like a graphics thingy you depend on is requesting a mapping with a >>>> not-allowed cache mode, and now you are (rightfully) getting errors?" >>> >>> It would be nice to get some more specific pointers what amdgpu (or >>> maybe ttm, since that calls vm_insert_mixed in ttm_bo_vm_fault) might be >>> doing wrong. > >/* > * We'd like to use VM_PFNMAP on shared mappings, where > * (vma->vm_flags & VM_SHARED) != 0, for performance reasons, > * but for some reason VM_PFNMAP + x86 PAT + write-combine is very > * bad for performance. Until that has been sorted out, use > * VM_MIXEDMAP on all mappings. See freedesktop.org bug #75719 > */ > vma->vm_flags |= VM_MIXEDMAP; > > We have that comment in the ttm code, which to me implies that mixed is > doing the right thing now, but that is slow, as the interface we > should be using. > Aren't there only 2 possibilities for this regression? 1/ a memtype entry was never made so track_pfn_insert() returns an uncached mapping 2/ a conflicting memtype entry exists and undefined behavior due to mixed mapping types is avoided with the change.
mm: fix cache mode tracking in vm_insert_mixed() breaks AMDGPU [was: Re: Latest testing with drm-next-4.9-wip and latest LLVM/mesa stack - Regression in PowerPlay/DPM on CIK?]
On Mon, Oct 17, 2016 at 8:48 PM, Dave Airlie wrote: [..] >>> Aren't there only 2 possibilities for this regression? >>> >>> 1/ a memtype entry was never made so track_pfn_insert() returns an >>> uncached mapping >>> >>> 2/ a conflicting memtype entry exists and undefined behavior due to >>> mixed mapping types is avoided with the change. >> >> 3/ The CPU usage through this path goes up, and slows things down, >> though I suspect you it's more an uncached mapping showing up >> when we don't expect it. > > It's looking line number 1, there is no mapping, now we get uncached > where we used to get write through. > > difference in page prot 7f7bbc0e, pfn 200e71e4, > 8037, 802f > > 0x2f is the vma pg prot which has PWT set in it, 0x37 is the returned > pgprot which lacks that bit. > > not sure where to go from here, suggestions? If the driver established an ioremap_wt() across the range, or just called reserve_memtype() directly that should restore WT mappings. Although Daniel's suggestion to use the i915 mapping helpers sounds like it avoids problem 3/ as well.
Enabling peer to peer device transactions for PCIe devices
On Mon, Nov 21, 2016 at 12:36 PM, Deucher, Alexander wrote: > This is certainly not the first time this has been brought up, but I'd like > to try and get some consensus on the best way to move this forward. Allowing > devices to talk directly improves performance and reduces latency by avoiding > the use of staging buffers in system memory. Also in cases where both > devices are behind a switch, it avoids the CPU entirely. Most current APIs > (DirectGMA, PeerDirect, CUDA, HSA) that deal with this are pointer based. > Ideally we'd be able to take a CPU virtual address and be able to get to a > physical address taking into account IOMMUs, etc. Having struct pages for > the memory would allow it to work more generally and wouldn't require as much > explicit support in drivers that wanted to use it. > > Some use cases: > 1. Storage devices streaming directly to GPU device memory > 2. GPU device memory to GPU device memory streaming > 3. DVB/V4L/SDI devices streaming directly to GPU device memory > 4. DVB/V4L/SDI devices streaming directly to storage devices > > Here is a relatively simple example of how this could work for testing. This > is obviously not a complete solution. > - Device memory will be registered with Linux memory sub-system by created > corresponding struct page structures for device memory > - get_user_pages_fast() will return corresponding struct pages when CPU > address points to the device memory > - put_page() will deal with struct pages for device memory > [..] > 4. iopmem > iopmem : A block device for PCIe memory (https://lwn.net/Articles/703895/) The change I suggest for this particular approach is to switch to "device-DAX" [1]. I.e. a character device for establishing DAX mappings rather than a block device plus a DAX filesystem. The pro of this approach is standard user pointers and struct pages rather than a new construct. The con is that this is done via an interface separate from the existing gpu and storage device. For example it would require a /dev/dax instance alongside a /dev/nvme interface, but I don't see that as a significant blocking concern. [1]: https://lists.01.org/pipermail/linux-nvdimm/2016-October/007496.html
Enabling peer to peer device transactions for PCIe devices
On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch wrote: > Dan, > > I personally like "device-DAX" idea but my concerns are: > > - How well it will co-exists with the DRM infrastructure / implementations >in part dealing with CPU pointers? Inside the kernel a device-DAX range is "just memory" in the sense that you can perform pfn_to_page() on it and issue I/O, but the vma is not migratable. To be honest I do not know how well that co-exists with drm infrastructure. > - How well we will be able to handle case when we need to "move"/"evict" >memory/data to the new location so CPU pointer should point to the new > physical location/address > (and may be not in PCI device memory at all)? So, device-DAX deliberately avoids support for in-kernel migration or overcommit. Those cases are left to the core mm or drm. The device-dax interface is for cases where all that is needed is a direct-mapping to a statically-allocated physical-address range be it persistent memory or some other special reserved memory range.
Enabling peer to peer device transactions for PCIe devices
On Tue, Nov 22, 2016 at 12:10 PM, Daniel Vetter wrote: > On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams > wrote: >> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch >> wrote: >>> I personally like "device-DAX" idea but my concerns are: >>> >>> - How well it will co-exists with the DRM infrastructure / implementations >>>in part dealing with CPU pointers? >> >> Inside the kernel a device-DAX range is "just memory" in the sense >> that you can perform pfn_to_page() on it and issue I/O, but the vma is >> not migratable. To be honest I do not know how well that co-exists >> with drm infrastructure. >> >>> - How well we will be able to handle case when we need to "move"/"evict" >>>memory/data to the new location so CPU pointer should point to the new >>> physical location/address >>> (and may be not in PCI device memory at all)? >> >> So, device-DAX deliberately avoids support for in-kernel migration or >> overcommit. Those cases are left to the core mm or drm. The device-dax >> interface is for cases where all that is needed is a direct-mapping to >> a statically-allocated physical-address range be it persistent memory >> or some other special reserved memory range. > > For some of the fancy use-cases (e.g. to be comparable to what HMM can > pull off) I think we want all the magic in core mm, i.e. migration and > overcommit. At least that seems to be the very strong drive in all > general-purpose gpu abstractions and implementations, where memory is > allocated with malloc, and then mapped/moved into vram/gpu address > space through some magic, but still visible on both the cpu and gpu > side in some form. Special device to allocate memory, and not being > able to migrate stuff around sound like misfeatures from that pov. Agreed. For general purpose P2P use cases where all you want is direct-I/O to a memory range that happens to be on a PCIe device then I think a special device fits the bill. For gpu P2P use cases that already have migration/overcommit expectations then it is not a good fit.
Enabling peer to peer device transactions for PCIe devices
On Tue, Nov 22, 2016 at 1:03 PM, Daniel Vetter wrote: > On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch > wrote: >> >> On 2016-11-22 03:10 PM, Daniel Vetter wrote: >>> >>> On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams >>> wrote: >>>> >>>> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch >>>> wrote: >>>>> >>>>> I personally like "device-DAX" idea but my concerns are: >>>>> >>>>> - How well it will co-exists with the DRM infrastructure / >>>>> implementations >>>>> in part dealing with CPU pointers? >>>> >>>> Inside the kernel a device-DAX range is "just memory" in the sense >>>> that you can perform pfn_to_page() on it and issue I/O, but the vma is >>>> not migratable. To be honest I do not know how well that co-exists >>>> with drm infrastructure. >>>> >>>>> - How well we will be able to handle case when we need to >>>>> "move"/"evict" >>>>> memory/data to the new location so CPU pointer should point to the >>>>> new >>>>> physical location/address >>>>> (and may be not in PCI device memory at all)? >>>> >>>> So, device-DAX deliberately avoids support for in-kernel migration or >>>> overcommit. Those cases are left to the core mm or drm. The device-dax >>>> interface is for cases where all that is needed is a direct-mapping to >>>> a statically-allocated physical-address range be it persistent memory >>>> or some other special reserved memory range. >>> >>> For some of the fancy use-cases (e.g. to be comparable to what HMM can >>> pull off) I think we want all the magic in core mm, i.e. migration and >>> overcommit. At least that seems to be the very strong drive in all >>> general-purpose gpu abstractions and implementations, where memory is >>> allocated with malloc, and then mapped/moved into vram/gpu address >>> space through some magic, >> >> It is possible that there is other way around: memory is requested to be >> allocated and should be kept in vram for performance reason but due >> to possible overcommit case we need at least temporally to "move" such >> allocation to system memory. > > With migration I meant migrating both ways of course. And with stuff > like numactl we can also influence where exactly the malloc'ed memory > is allocated originally, at least if we'd expose the vram range as a > very special numa node that happens to be far away and not hold any > cpu cores. I don't think we should be using numa distance to reverse engineer a certain allocation behavior. The latency data should be truthful, but you're right we'll need a mechanism to keep general purpose allocations out of that range by default. Btw, strict isolation is another design point of device-dax, but I think in this case we're describing something between the two extremes of full isolation and full compatibility with existing numactl apis.
Enabling peer to peer device transactions for PCIe devices
On Wed, Nov 23, 2016 at 9:27 AM, Bart Van Assche wrote: > On 11/23/2016 09:13 AM, Logan Gunthorpe wrote: >> >> IMO any memory that has been registered for a P2P transaction should be >> locked from being evicted. So if there's a get_user_pages call it needs >> to be pinned until the put_page. The main issue being with the RDMA >> case: handling an eviction when a chunk of memory has been registered as >> an MR would be very tricky. The MR may be relied upon by another host >> and the kernel would have to inform user-space the MR was invalid then >> user-space would have to tell the remote application. > > > Hello Logan, > > Are you aware that the Linux kernel already supports ODP (On Demand Paging)? > See also the output of git grep -nHi on.demand.paging. See also > https://www.openfabrics.org/images/eventpresos/workshops2014/DevWorkshop/presos/Tuesday/pdf/04_ODP_update.pdf. > I don't think that was designed for the case where the backing memory is a special/static physical address range rather than anonymous "System RAM", right? I think we should handle the graphics P2P concerns separately from the general P2P-DMA case since the latter does not require the higher order memory management facilities. Using ZONE_DEVICE/DAX mappings to avoid changes to every driver that wants to support P2P-DMA separately from typical DMA still seems the path of least resistance.
Enabling peer to peer device transactions for PCIe devices
On Wed, Nov 23, 2016 at 1:55 PM, Jason Gunthorpe wrote: > On Wed, Nov 23, 2016 at 02:11:29PM -0700, Logan Gunthorpe wrote: >> > As I said, there is no possible special handling. Standard IB hardware >> > does not support changing the DMA address once a MR is created. Forget >> > about doing that. >> >> Yeah, that's essentially the point I was trying to make. Not to mention >> all the other unrelated hardware that can't DMA to an address that might >> disappear mid-transfer. > > Right, it is impossible to ask for generic page migration with ongoing > DMA. That is simply not supported by any of the hardware at all. > >> > Only ODP hardware allows changing the DMA address on the fly, and it >> > works at the page table level. We do not need special handling for >> > RDMA. >> >> I am aware of ODP but, noted by others, it doesn't provide a general >> solution to the points above. > > How do you mean? > > Perhaps I am not following what Serguei is asking for, but I > understood the desire was for a complex GPU allocator that could > migrate pages between GPU and CPU memory under control of the GPU > driver, among other things. The desire is for DMA to continue to work > even after these migrations happen. > > Page table mirroring *is* the general solution for this problem. The > GPU driver controls the VMA and the DMA driver mirrors that VMA. > > Do you know of another option that doesn't just degenerate to page > table mirroring?? > > Remember, there are two facets to the RDMA ODP implementation, I feel > there is some confusion here.. > > The crucial part for this discussion is the ability to fence and block > DMA for a specific range. This is the hardware capability that lets > page migration happen: fence&block DMA, migrate page, update page > table in HCA, unblock DMA. Wait, ODP requires migratable pages, ZONE_DEVICE pages are not migratable. You can't replace a PCIe mapping with just any other System RAM physical address, right? At least not without a filesystem recording where things went, but at point we're no longer talking about the base P2P-DMA mapping mechanism and are instead talking about something like pnfs-rdma to a DAX filesystem.
[PATCH v2 0/2] fix cache mode tracking for pmem + dax
While writing an improved changelog, as prompted by Andrew, for v1 of "mm: fix cache mode of dax pmd mappings" [1], it struck me that vmf_insert_pfn_pmd() is implemented correctly. Instead, it is the memtype tree that is missing a memtype reservation for devm_memremap_pages() ranges. vmf_insert_pfn_pmd() is correct to validate the memtype before inserting a mapping, but this highlights that vm_insert_mixed() is missing this validation. I would still like to take patch 1 through the nvdimm.git tree, with -mm acks, along with the device-dax fixes for v4.8-rc6. Patch 2 can go the typical -mm route for v4.9 since it has potential to change behavior in its DRI usages, needs soak time in -next, and there no known memtype conflict problems it would fix. [1]: https://lists.01.org/pipermail/linux-nvdimm/2016-September/006781.html --- Dan Williams (2): mm: fix cache mode of dax pmd mappings mm: fix cache mode tracking in vm_insert_mixed() arch/x86/mm/pat.c | 17 ++--- kernel/memremap.c |9 + mm/memory.c |8 ++-- 3 files changed, 25 insertions(+), 9 deletions(-)
[PATCH v2 1/2] mm: fix cache mode of dax pmd mappings
track_pfn_insert() in vmf_insert_pfn_pmd() is marking dax mappings as uncacheable rendering them impractical for application usage. DAX-pte mappings are cached and the goal of establishing DAX-pmd mappings is to attain more performance, not dramatically less (3 orders of magnitude). track_pfn_insert() relies on a previous call to reserve_memtype() to establish the expected page_cache_mode for the range. While memremap() arranges for reserve_memtype() to be called, devm_memremap_pages() does not. So, teach track_pfn_insert() and untrack_pfn() how to handle tracking without a vma, and arrange for devm_memremap_pages() to establish the write-back-cache reservation in the memtype tree. Cc: Cc: Matthew Wilcox Cc: Andrew Morton Cc: Ross Zwisler Cc: Nilesh Choudhury Cc: Kirill A. Shutemov Reported-by: Toshi Kani Reported-by: Kai Zhang Signed-off-by: Dan Williams --- arch/x86/mm/pat.c | 17 ++--- kernel/memremap.c |9 + 2 files changed, 19 insertions(+), 7 deletions(-) diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c index ecb1b69c1651..170cc4ff057b 100644 --- a/arch/x86/mm/pat.c +++ b/arch/x86/mm/pat.c @@ -927,9 +927,10 @@ int track_pfn_copy(struct vm_area_struct *vma) } /* - * prot is passed in as a parameter for the new mapping. If the vma has a - * linear pfn mapping for the entire range reserve the entire vma range with - * single reserve_pfn_range call. + * prot is passed in as a parameter for the new mapping. If the vma has + * a linear pfn mapping for the entire range, or no vma is provided, + * reserve the entire pfn + size range with single reserve_pfn_range + * call. */ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, unsigned long pfn, unsigned long addr, unsigned long size) @@ -938,11 +939,12 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, enum page_cache_mode pcm; /* reserve the whole chunk starting from paddr */ - if (addr == vma->vm_start && size == (vma->vm_end - vma->vm_start)) { + if (!vma || (addr == vma->vm_start + && size == (vma->vm_end - vma->vm_start))) { int ret; ret = reserve_pfn_range(paddr, size, prot, 0); - if (!ret) + if (ret == 0 && vma) vma->vm_flags |= VM_PAT; return ret; } @@ -997,7 +999,7 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn, resource_size_t paddr; unsigned long prot; - if (!(vma->vm_flags & VM_PAT)) + if (vma && !(vma->vm_flags & VM_PAT)) return; /* free the chunk starting from pfn or the whole chunk */ @@ -1011,7 +1013,8 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn, size = vma->vm_end - vma->vm_start; } free_pfn_range(paddr, size); - vma->vm_flags &= ~VM_PAT; + if (vma) + vma->vm_flags &= ~VM_PAT; } /* diff --git a/kernel/memremap.c b/kernel/memremap.c index 251d16b4cb41..b501e390bb34 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -247,6 +247,7 @@ static void devm_memremap_pages_release(struct device *dev, void *data) align_start = res->start & ~(SECTION_SIZE - 1); align_size = ALIGN(resource_size(res), SECTION_SIZE); arch_remove_memory(align_start, align_size); + untrack_pfn(NULL, PHYS_PFN(align_start), align_size); pgmap_radix_release(res); dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc, "%s: failed to free all reserved pages\n", __func__); @@ -282,6 +283,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res, struct percpu_ref *ref, struct vmem_altmap *altmap) { resource_size_t key, align_start, align_size, align_end; + pgprot_t pgprot = PAGE_KERNEL; struct dev_pagemap *pgmap; struct page_map *page_map; int error, nid, is_ram; @@ -351,6 +353,11 @@ void *devm_memremap_pages(struct device *dev, struct resource *res, if (nid < 0) nid = numa_mem_id(); + error = track_pfn_remap(NULL, &pgprot, PHYS_PFN(align_start), 0, + align_size); + if (error) + goto err_pfn_remap; + error = arch_add_memory(nid, align_start, align_size, true); if (error) goto err_add_memory; @@ -371,6 +378,8 @@ void *devm_memremap_pages(struct device *dev, struct resource *res, return __va(res->start); err_add_memory: + untrack_pfn(NULL, PHYS_PFN(align_start), align_size); + err_pfn_remap: err_radix: pgmap_radix_release(res); devres_free(page_map);
[PATCH v2 2/2] mm: fix cache mode tracking in vm_insert_mixed()
vm_insert_mixed() unlike vm_insert_pfn_prot() and vmf_insert_pfn_pmd(), fails to check the pgprot_t it uses for the mapping against the one recorded in the memtype tracking tree. Add the missing call to track_pfn_insert() to preclude cases where incompatible aliased mappings are established for a given physical address range. Cc: David Airlie Cc: dri-devel at lists.freedesktop.org Cc: Matthew Wilcox Cc: Andrew Morton Cc: Ross Zwisler Signed-off-by: Dan Williams --- mm/memory.c |8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 83be99d9d8a1..8841fed328f9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1649,10 +1649,14 @@ EXPORT_SYMBOL(vm_insert_pfn_prot); int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, pfn_t pfn) { + pgprot_t pgprot = vma->vm_page_prot; + BUG_ON(!(vma->vm_flags & VM_MIXEDMAP)); if (addr < vma->vm_start || addr >= vma->vm_end) return -EFAULT; + if (track_pfn_insert(vma, &pgprot, pfn)) + return -EINVAL; /* * If we don't have pte special, then we have to use the pfn_valid() @@ -1670,9 +1674,9 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, * result in pfn_t_has_page() == false. */ page = pfn_to_page(pfn_t_to_pfn(pfn)); - return insert_page(vma, addr, page, vma->vm_page_prot); + return insert_page(vma, addr, page, pgprot); } - return insert_pfn(vma, addr, pfn, vma->vm_page_prot); + return insert_pfn(vma, addr, pfn, pgprot); } EXPORT_SYMBOL(vm_insert_mixed);
[PATCH v1 05/10] ACPI: switch to use generic UUID API
On Wed, Feb 17, 2016 at 4:17 AM, Andy Shevchenko wrote: > Instead of opencoding the existing library functions let's use them directly. > > The conversion fixes a potential bug in int340x_thermal as well since we have > to use memcmp() on binary data. > > Signed-off-by: Andy Shevchenko > --- > drivers/acpi/acpi_extlog.c| 8 +++--- > drivers/acpi/bus.c| 29 ++- > drivers/acpi/nfit.c | 34 > +++ > drivers/acpi/nfit.h | 3 +- > drivers/acpi/utils.c | 4 +-- > drivers/char/tpm/tpm_crb.c| 9 +++--- > drivers/char/tpm/tpm_ppi.c| 20 ++--- > drivers/gpu/drm/i915/intel_acpi.c | 14 -- > drivers/gpu/drm/nouveau/nouveau_acpi.c| 20 ++--- > drivers/gpu/drm/nouveau/nvkm/subdev/mxm/base.c| 9 +++--- > drivers/hid/i2c-hid/i2c-hid.c | 9 +++--- > drivers/iommu/dmar.c | 11 > drivers/pci/pci-acpi.c| 11 > drivers/pci/pci-label.c | 4 +-- > drivers/thermal/int340x_thermal/int3400_thermal.c | 6 ++-- > drivers/usb/host/xhci-pci.c | 9 +++--- > include/acpi/acpi_bus.h | 10 --- > include/linux/acpi.h | 2 +- > include/linux/pci-acpi.h | 2 +- > sound/soc/intel/skylake/skl-nhlt.c| 7 +++-- > 20 files changed, 92 insertions(+), 129 deletions(-) > [..] > diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c > index ad6d8c6..3beb99b 100644 > --- a/drivers/acpi/nfit.c > +++ b/drivers/acpi/nfit.c > @@ -43,11 +43,11 @@ struct nfit_table_prev { > struct list_head flushes; > }; > > -static u8 nfit_uuid[NFIT_UUID_MAX][16]; > +static uuid_le nfit_uuid[NFIT_UUID_MAX]; > > -const u8 *to_nfit_uuid(enum nfit_uuids id) > +const uuid_le *to_nfit_uuid(enum nfit_uuids id) > { > - return nfit_uuid[id]; > + return &nfit_uuid[id]; > } > EXPORT_SYMBOL(to_nfit_uuid); > > @@ -83,7 +83,7 @@ static int acpi_nfit_ctl(struct nvdimm_bus_descriptor > *nd_desc, > const char *cmd_name, * dimm_name; > unsigned long dsm_mask; > acpi_handle handle; > - const u8 *uuid; > + const uuid_le *uuid; > u32 offset; > int rc, i; > > @@ -225,7 +225,7 @@ static int nfit_spa_type(struct acpi_nfit_system_address > *spa) > int i; > > for (i = 0; i < NFIT_UUID_MAX; i++) > - if (memcmp(to_nfit_uuid(i), spa->range_guid, 16) == 0) > + if (!uuid_le_cmp(*to_nfit_uuid(i), *(uuid_le > *)spa->range_guid)) > return i; > return -1; > } > @@ -829,7 +829,7 @@ static int acpi_nfit_add_dimm(struct acpi_nfit_desc > *acpi_desc, > { > struct acpi_device *adev, *adev_dimm; > struct device *dev = acpi_desc->dev; > - const u8 *uuid = to_nfit_uuid(NFIT_DEV_DIMM); > + const uuid_le *uuid = to_nfit_uuid(NFIT_DEV_DIMM); > int i; > > nfit_mem->dsm_mask = acpi_desc->dimm_dsm_force_en; > @@ -909,7 +909,7 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc > *acpi_desc) > static void acpi_nfit_init_dsms(struct acpi_nfit_desc *acpi_desc) > { > struct nvdimm_bus_descriptor *nd_desc = &acpi_desc->nd_desc; > - const u8 *uuid = to_nfit_uuid(NFIT_DEV_BUS); > + const uuid_le *uuid = to_nfit_uuid(NFIT_DEV_BUS); > struct acpi_device *adev; > int i; > > @@ -2079,16 +2079,16 @@ static __init int nfit_init(void) > BUILD_BUG_ON(sizeof(struct acpi_nfit_control_region) != 80); > BUILD_BUG_ON(sizeof(struct acpi_nfit_data_region) != 40); > > - acpi_str_to_uuid(UUID_VOLATILE_MEMORY, nfit_uuid[NFIT_SPA_VOLATILE]); > - acpi_str_to_uuid(UUID_PERSISTENT_MEMORY, nfit_uuid[NFIT_SPA_PM]); > - acpi_str_to_uuid(UUID_CONTROL_REGION, nfit_uuid[NFIT_SPA_DCR]); > - acpi_str_to_uuid(UUID_DATA_REGION, nfit_uuid[NFIT_SPA_BDW]); > - acpi_str_to_uuid(UUID_VOLATILE_VIRTUAL_DISK, > nfit_uuid[NFIT_SPA_VDISK]); > - acpi_str_to_uuid(UUID_VOLATILE_VIRTUAL_CD, nfit_uuid[NFIT_SPA_VCD]); > - acpi_str_to_uuid(UUID_PERSISTENT_VIRTUAL_DISK, > nfit_uuid[NFIT_SPA_PDISK]); > - acpi_str_to_uuid(UUID_PERSISTENT_VIRTUAL_CD, nfit_uuid[NFIT_SPA_PCD]); > - acpi_str_to_uuid(UUID_NFIT_BUS, nfit_uuid[NFIT_DEV_BUS]); > - acpi_str_to_uuid(UUID_NFIT_DIMM, nfit_uuid[NFIT_DEV_DIMM]); > + uuid_le_to_bin(UUID_VOLATILE_MEMORY, &nfit_uuid[NFIT_SPA_VOLATILE]); > + uuid_le_to_bin(UUID_PERSISTENT_MEMORY, &nfit_uuid[NFIT_SPA_PM]); > + uuid_le_to_bin(UUID_CONTROL_REGION, &nfit_uuid[NFIT_SPA_DCR]); > + uuid_le_to_bin(UUID_DATA_REGION, &nfit_uuid[NFIT_SPA_BDW]); > + uuid_le_to_bin(UUID_VOLATI