pper
layers, e.g. default pci_iomap() for MMIOs doesn't need to specify anything
on cache mode. There's some pci_iomap_wc() variance, but still looks like
only the internal has full control..
--
Peter Xu
On Mon, Apr 28, 2025 at 10:37:49PM +0200, David Hildenbrand wrote:
> On 28.04.25 18:21, Peter Xu wrote:
> > On Mon, Apr 28, 2025 at 04:58:46PM +0200, David Hildenbrand wrote:
> > >
> > > > > What it does on PAT (only implementation so far ...) is looking up the
&g
fd_ctx; /* 176 0 */
Not something that matters.. but just in case you didn't use the expected
config file you wanted to use..
--
Peter Xu
ing up the memtype, so I would expect it is
> > guaranteed all pfns under the same vma is following the verified (and same)
> > memtype?
>
> The whole point of track_pfn_insert() is that it is used when we *don't* use
> reserve_pfn_range()->track_pfn_remap(), no?
>
> track_pfn_remap() would check the whole range that gets mapped, so
> track_pfn_insert() user must similarly check the whole range that gets
> mapped.
>
> Note that even track_pfn_insert() is already pretty clear on the intended
> usage: "called when a _new_ single pfn is established"
We need to define "new" then.. But I agree it's not crystal clear at
least. I think I just wasn't the first to assume it was reserved, see this
(especially, the "Expectation" part..):
commit 5180da410db6369d1f95c9014da1c9bc33fb043e
Author: Suresh Siddha
Date: Mon Oct 8 16:28:29 2012 -0700
x86, pat: separate the pfn attribute tracking for remap_pfn_range and
vm_insert_pfn
With PAT enabled, vm_insert_pfn() looks up the existing pfn memory
attribute and uses it. Expectation is that the driver reserves the
memory attributes for the pfn before calling vm_insert_pfn().
--
Peter Xu
On Fri, Apr 25, 2025 at 10:36:55PM +0200, David Hildenbrand wrote:
> On 25.04.25 22:23, Peter Xu wrote:
> > On Fri, Apr 25, 2025 at 10:17:09AM +0200, David Hildenbrand wrote:
> > > Let's use our new interface. In remap_pfn_range(), we'll now decide
> > > wheth
On Fri, Apr 25, 2025 at 09:48:50PM +0200, David Hildenbrand wrote:
> On 25.04.25 21:31, Peter Xu wrote:
> > On Fri, Apr 25, 2025 at 10:17:06AM +0200, David Hildenbrand wrote:
> > > ... by factoring it out from track_pfn_remap().
> > >
> > > For PMDs/PUDs
it slightly slower than vma->pfnmap_track_ctx ref when
looking up pfn when holding a vma ref, but I assume it's ok considering
that track/untrack should be slow path for pfnmaps, and pfnmaps shouldn't
be a huge lot.
I didn't think further, but if that'll work it'll definitely avoids the
additional fields on x86 vmas. I'm curious whether you explored that
direction, or maybe it's a known decision that the 8 bytes isn't a concern.
Thanks,
--
Peter Xu
(range));
> err_kasan:
> - untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
> + pfnmap_untrack(PHYS_PFN(range->start), range_len(range));
Not a huge deal, but maybe we could merge this and previous patch? It
might be easier to reference the impl when reading the call site changes.
> err_pfn_remap:
> pgmap_array_delete(range);
> return error;
> --
> 2.49.0
>
--
Peter Xu
563,7 @@ vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct
> *vma, unsigned long addr,
> if (!pfn_modify_allowed(pfn, pgprot))
> return VM_FAULT_SIGBUS;
>
> - track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV));
> + pfnmap_sanitize_pgprot(pfn, PAGE_SIZE, &pgprot);
>
> return insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,
> false);
> @@ -2626,7 +2626,7 @@ static vm_fault_t __vm_insert_mixed(struct
> vm_area_struct *vma,
> if (addr < vma->vm_start || addr >= vma->vm_end)
> return VM_FAULT_SIGBUS;
>
> - track_pfn_insert(vma, &pgprot, pfn);
> + pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot);
>
> if (!pfn_modify_allowed(pfn_t_to_pfn(pfn), pgprot))
> return VM_FAULT_SIGBUS;
> --
> 2.49.0
>
--
Peter Xu
guess people are just waiting for hardware vendors to support device page
faults, like processors - then we can synchronize with the device using the
device IOMMU page tables (by clearing it at proper time and blocks DMA
writes).
Thanks,
--
Peter Xu
ing ram discard just like what vfio does already.
Thanks,
--
Peter Xu
ot after VM started there, so either not affected or maybe there's
chance it'll work.
IIUC it's then the same as VFIO attached then we try to blow some pages
away from anything like virtio-balloon - AFAIR qemu just explicitly don't
allow that to happen. See vfio_ram_block_discard_disable().
>
> > FOLL_GET). The good thing is then the fd can be the guest memory file
> > itself. With that, we can mmap() over the shmem/hugetlb in whatever vma
> > and whatever process. Truncation (and actually everything... e.g. page
> > migration, swapping, ... which will be disabled if we use PFNMAP pins) will
> > just all start to work, afaiu.
> IIUC, we'd not be able to use the fd of the guest memory file because the
> dmabuf fds are expected to have constant size that reflects the size of the
> buffer that is being shared. I just don't think it'd be feasible given all the
> other restrictions:
> https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html?highlight=dma_buf#userspace-interface-notes
Yeah I also don't know well on the dmabuf APIs, but I think if the page
must be pinned for real world DMA then it's already another story to me..
what I said on the [guest_mem_fd, offset_array] tuple idea could only (if
still possible..) work if the udmabuf access is only from the processor
side, never from the device.
Thanks,
--
Peter Xu
n be e.g. a list of
file offsets that points to the pages (rather than pinning the pages using
FOLL_GET). The good thing is then the fd can be the guest memory file
itself. With that, we can mmap() over the shmem/hugetlb in whatever vma
and whatever process. Truncation (and actually everything... e.g. page
migration, swapping, ... which will be disabled if we use PFNMAP pins) will
just all start to work, afaiu.
Thanks,
--
Peter Xu
tlb and shmem
> > pages is different. hugetlb requires the pages exist in the page cache
> > while shmem will create/add pages to the cache if necessary.
> Ok; got it.
Would it be nice to do the same for both (perhaps always try to allocate)?
Thanks,
--
Peter Xu
On Tue, Jun 27, 2023 at 12:52:34PM -0300, Jason Gunthorpe wrote:
> On Mon, Jun 26, 2023 at 03:04:21PM -0400, Peter Xu wrote:
> > On Mon, Jun 26, 2023 at 03:18:48PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Jun 26, 2023 at 08:14:27PM +0200, David Hildenbrand wrote:
> > >
st udmabuf should just only work with shared
memories; not sure on iommufd, though.
--
Peter Xu
Having said that, one option I could think of is to probably install a
> mmu_notifier
> associated with the relevant pages in udmabuf and once we get notified about
> any invalidation event concerning any of the pages, we'd fail any subsequent
> attempt to access these pages and propagate the error across the stack.
Sounds right, maybe it needs to go back to the old GUP solution, though, as
mmu notifiers are also mm-based not fd-based. Or to be explicit, I think
it'll be pin_user_pages(FOLL_LONGTERM) with the new API. It'll also solve
the movable pages issue on pinning.
>
> However, it feels like udmabuf is not the right place to handle this issue
> because
> there are very limited options for taking proper corrective action if Qemu
> decides
> to punch a hole in Guest memory that takes out pages that are pinned.
I'm not familiar with the use case that much, but IMHO it's fine if the
driver relies on proper behaviors of userapp to work.
IIUC the worst case here is the udmabuf holds some pages that are not the
pages of the guest mem anymore, but it only happens on misbehaved
userspace, then it looks all fine as long as they can at least be properly
released when releasing the udmabuf. It'll be better if the udmabuf can
fail hard when detecting this, but IMHO even that can be optional depending
on the need, while any corrective action will be even one step further.
Thanks,
--
Peter Xu
On Fri, Jun 23, 2023 at 01:37:58PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 23, 2023 at 12:35:45PM -0400, Peter Xu wrote:
>
> > It seems the previous concern on using gup was majorly fork(), if this is
> > it:
> >
> > https://patchwork.freedesktop.org/pa
>
> >
> >
> > There are *probably* more issues on the QEMU side when udmabuf is
> > paired
> > with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for
> > virtio-balloon, virtio-mem, postcopy live migration, ... for example, in
> > the vfio/vdpa case we make sure that we disallow most of these, because
> > otherwise there can be an accidental "disconnect" between the pages
> > mapped into the VM (guest view) and the pages mapped into the IOMMU
> > (device view), for example, after a reboot.
> Ok; I am not sure if I can figure out if there is any acceptable way to
> address
> these issues but given the current constraints associated with udmabuf, what
> do you suggest is the most reasonable way to deal with these problems you
> have identified?
>
> Thanks,
> Vivek
>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
>
--
Peter Xu
On Thu, Jun 15, 2023 at 09:15:26PM +0100, Matthew Wilcox wrote:
> On Thu, Jun 15, 2023 at 03:33:12PM -0400, Peter Xu wrote:
> > My question is whether page_zonenum() is ready for taking all kinds of tail
> > pages?
> >
> > Zone device tail pages all look fine, per memm
; is_pci_p2pdma_page -> is_zone_device_page -> page_zonenum
I'm worried it'll just read fake things, but maybe I just missed something?
Thanks,
--
Peter Xu
On Tue, Jun 15, 2021 at 01:08:11PM +1000, Alistair Popple wrote:
> On Saturday, 12 June 2021 1:01:42 AM AEST Peter Xu wrote:
> > On Fri, Jun 11, 2021 at 01:43:20PM +1000, Alistair Popple wrote:
> > > On Friday, 11 June 2021 11:00:34 AM AEST Peter Xu wrote:
> > > >
On Fri, Jun 11, 2021 at 01:43:20PM +1000, Alistair Popple wrote:
> On Friday, 11 June 2021 11:00:34 AM AEST Peter Xu wrote:
> > On Fri, Jun 11, 2021 at 09:17:14AM +1000, Alistair Popple wrote:
> > > On Friday, 11 June 2021 9:04:19 AM AEST Peter Xu wrote:
> > > > On F
On Fri, Jun 11, 2021 at 09:17:14AM +1000, Alistair Popple wrote:
> On Friday, 11 June 2021 9:04:19 AM AEST Peter Xu wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Fri, Jun 11, 2021 at 12:21:26AM +1000, Alistair Popple wrote:
> &
ns_huge() should return
true above, so we'll skip follow_page_pte(); then we'll check FOLL_SPLIT_PMD
and do the split, then the CLEAR notify. Hmm.. Did I miss something?
--
Peter Xu
s it also mean that if there's a real THP it won't really work? As then
FOLL_SPLIT_PMD will start to trigger that CLEAR notify too, I think..
--
Peter Xu
On Wed, Jun 09, 2021 at 07:38:04PM +1000, Alistair Popple wrote:
> On Wednesday, 9 June 2021 4:33:52 AM AEST Peter Xu wrote:
> > On Mon, Jun 07, 2021 at 05:58:52PM +1000, Alistair Popple wrote:
> >
> > [...]
> >
> > > +static bool page_make_de
L_SPLIT_PMD,
> +pages, NULL, NULL);
> + for (i = 0; i < npages; i++, start += PAGE_SIZE) {
> + if (!trylock_page(pages[i])) {
> + put_page(pages[i]);
> + pages[i] = NULL;
> + continue;
> + }
> +
> + if (!page_make_device_exclusive(pages[i], mm, start, owner)) {
> + unlock_page(pages[i]);
> + put_page(pages[i]);
> + pages[i] = NULL;
> + }
> + }
> +
> + return npages;
> +}
> +EXPORT_SYMBOL_GPL(make_device_exclusive_range);
> +#endif
> +
> void __put_anon_vma(struct anon_vma *anon_vma)
> {
> struct anon_vma *root = anon_vma->root;
> --
> 2.20.1
>
--
Peter Xu
_WARN_ON_ONCE(1);
> }
> +
> + /* We've captured and resolved the error. Reset, try again. */
Maybe better as:
/*
* We've resolved all error even if there is, reset error code and try
* again if necessary.
*/
as it also covers the no-error path. But I guess not a big deal..
Reviewed-by: Peter Xu
Thanks,
> + ret = 0;
> +
> if (addr != end)
> goto again;
> out:
> --
> 2.20.1
>
--
Peter Xu
gt; can also benefit from this filtering, so rename the
> 'migrate_pgmap_owner' field to 'owner' and create a new notifier
> initialisation function to initialise this field.
>
> Signed-off-by: Alistair Popple
> Suggested-by: Peter Xu
Reviewed-by: Peter Xu
--
Peter Xu
On Fri, Jun 04, 2021 at 11:07:42AM +1000, Alistair Popple wrote:
> On Friday, 4 June 2021 12:47:40 AM AEST Peter Xu wrote:
> > External email: Use caution opening links or attachments
> >
> > On Thu, Jun 03, 2021 at 09:39:32PM +1000, Alistair Popple wrote:
> > > Rec
code. Could you elaborate?
--
Peter Xu
age should not account for use by the device
> (are GPU workloads - access once and discard?)
Hmm, besides the aging info, this reminded me: do we need to isolate the page
from lru too when marking device exclusive access?
Afaict the current patch didn't do that so I think it's reclaimable. If we
still have the rmap then we'll get a mmu notify CLEAR when unmapping that
special pte, so device driver should be able to drop the ownership. However we
dropped the rmap when marking exclusive. Now I don't know whether and how
it'll work if page reclaim runs with the page being exclusively owned if
without isolating the page..
--
Peter Xu
ling split_huge_page() rather than split_huge_pmd().
But shouldn't FOLL_SPLIT_PMD filled in small pfns for each pte? See
__split_huge_pmd_locked():
for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
...
} else {
entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
...
}
...
set_pte_at(mm, addr, pte, entry);
}
Then iiuc the coming follow_page_pte() will directly fetch the small pages?
--
Peter Xu
ret = -EBUSY;
} else {
spin_unlock(ptl);
split_huge_pmd(vma, pmd, address);
ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
}
return ret ? ERR_PTR(ret) :
follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
}
So I thought all pages are small pages?
--
Peter Xu
On Thu, May 27, 2021 at 11:20:36AM +1000, Alistair Popple wrote:
> On Thursday, 27 May 2021 5:50:05 AM AEST Peter Xu wrote:
> > On Mon, May 24, 2021 at 11:27:21PM +1000, Alistair Popple wrote:
> > > Currently if copy_nonpresent_pte() returns a non-zero value it is
> > >
EM;
/* We've captured and resolved the error. Reset, try again. */
ret = 0;
}
We didn't reset "ret" in entry.val case (maybe we should?). Then in the next
round of "goto again" if "ret" is unluckily untouched, it could reach the 2nd
if check, and I think it could cause an unexpected page_copy_prealloc().
--
Peter Xu
callers are unaffected. Not a big deal, though:
Reviewed-by: Peter Xu
Thanks,
--
Peter Xu
--
>
> v9:
> * Split rename of migrate_pgmap_owner into a separate patch.
> * Added comments explaining SWP_DEVICE_EXCLUSIVE_* entries.
> * Renamed try_to_protect{_one} to page_make_device_exclusive{_one} based
> somewhat on a suggestion from Peter Xu. I was never particular
On Wed, May 19, 2021 at 10:28:42AM -0300, Jason Gunthorpe wrote:
> On Tue, May 18, 2021 at 07:45:05PM -0400, Peter Xu wrote:
> > On Tue, May 18, 2021 at 08:03:27PM -0300, Jason Gunthorpe wrote:
> > > Logically during fork all these device exclusive pages should be
> > >
On Wed, May 19, 2021 at 11:11:55PM +1000, Alistair Popple wrote:
> On Wednesday, 19 May 2021 10:15:41 PM AEST Peter Xu wrote:
> > External email: Use caution opening links or attachments
> >
> > On Wed, May 19, 2021 at 09:04:53PM +1000, Alistair Popple wrote:
> >
On Wed, May 19, 2021 at 08:49:01PM +1000, Alistair Popple wrote:
> On Wednesday, 19 May 2021 7:16:38 AM AEST Peter Xu wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Wed, Apr 07, 2021 at 06:42:35PM +1000, Alistair Popple wrote:
> &
ng about these two solutions and why we choose the GUP way?
--
Peter Xu
erspace device driver should fork() within busy interaction
with the device underneath..
In all cases, please still consider to keep them in copy_nonpresent_pte() (and
if to rework, separating patches would be great).
Thanks,
--
Peter Xu
respond to the changes in
copy_nonpresent_pte()) in this patch to guarantee it.
I also hope we don't make copy_pte_range() even more complicated just to do the
lock_page() right, so we could fail the fork() if the lock is hard to take.
--
Peter Xu
_DONTFORK in the user app), meanwhile
make sure mapcount(page)==1 before granting the page to the device, so that
this will guarantee this mm owns this page forever, I think? It'll simplify
fork() too as a side effect, since VM_DONTCOPY vma go away when fork.
--
Peter Xu
On Tue, May 18, 2021 at 04:45:09PM -0300, Jason Gunthorpe wrote:
> On Tue, May 18, 2021 at 02:01:36PM -0400, Peter Xu wrote:
> > > > Indeed it'll be odd for a COW page since for COW page then it means
> > > > after
> > > > parent/child writting t
On Tue, May 18, 2021 at 02:33:34PM -0300, Jason Gunthorpe wrote:
> On Tue, May 18, 2021 at 01:27:42PM -0400, Peter Xu wrote:
>
> > I also have a pure and high level question regarding a process fork() when
> > there're device exclusive ptes: would the two proces
).
> > IIUC this is the _only_ reason that we passed in "address" into
> > try_to_protect() too, and further into the ttp_args.
>
> Yes, this is why "address" is passed up to ttp_args.
>
> > The odd part is the remote GUP should have walked the page ta
On Tue, May 18, 2021 at 09:58:09PM +1000, Alistair Popple wrote:
> On Tuesday, 18 May 2021 12:17:32 PM AEST Peter Xu wrote:
> > On Wed, Apr 07, 2021 at 06:42:31PM +1000, Alistair Popple wrote:
> > > +static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
> >
undergoing migration.
> + */
> +static inline bool is_pfn_swap_entry(swp_entry_t entry)
> +{
> + return is_migration_entry(entry) || is_device_private_entry(entry);
> +}
--
Peter Xu
* locking requirements of exec(), migration skips
> + * temporary VMAs until after exec() completes.
> + */
> + if (!PageKsm(page) && PageAnon(page))
I think we can drop the PageAnon() check as it's just done above.
I feel like this chunk was copied over from try_to_unmap(), however is that
necessary? Is it possible that the caller of make_device_exclusive_range()
pass in a temp stack vma during exec()?
> + rwc.invalid_vma = invalid_migration_vma;
> +
> + rmap_walk(page, &rwc);
> +
> + return ttp.valid && !page_mapcount(page);
> +}
> +
> +/**
> + * make_device_exclusive_range() - Mark a range for exclusive use by a device
> + * @mm: mm_struct of assoicated target process
> + * @start: start of the region to mark for exclusive device access
> + * @end: end address of region
> + * @pages: returns the pages which were successfully marked for exclusive
> access
> + * @arg: passed to MMU_NOTIFY_EXCLUSIVE range notifier too allow filtering
s/too/to/?
> + *
> + * Returns: number of pages successfully marked for exclusive access
Hmm, I see that try_to_protect() can fail even if npages returned from GUP, so
perhaps "number of pages successfully GUPed, however the page is marked for
exclusive access only if the page pointer is non-NULL", or something like that?
> + *
> + * This function finds ptes mapping page(s) to the given address range, locks
> + * them and replaces mappings with special swap entries preventing userspace
> CPU
s/userspace//? As same for kernel access?
(I don't think I fully read all the codes in this patch, but I'll stop here for
today..)
Thanks,
> + * access. On fault these entries are replaced with the original mapping
> after
> + * calling MMU notifiers.
> + *
> + * A driver using this to program access from a device must use a mmu
> notifier
> + * critical section to hold a device specific lock during programming. Once
> + * programming is complete it should drop the page lock and reference after
> + * which point CPU access to the page will revoke the exclusive access.
> + */
> +int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
> + unsigned long end, struct page **pages,
> + void *arg)
> +{
> + unsigned long npages = (end - start) >> PAGE_SHIFT;
> + unsigned long i;
> +
> + npages = get_user_pages_remote(mm, start, npages,
> +FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
> +pages, NULL, NULL);
> + for (i = 0; i < npages; i++, start += PAGE_SIZE) {
> + if (!trylock_page(pages[i])) {
> + put_page(pages[i]);
> + pages[i] = NULL;
> + continue;
> + }
> +
> + if (!try_to_protect(pages[i], mm, start, arg)) {
> + unlock_page(pages[i]);
> + put_page(pages[i]);
> + pages[i] = NULL;
> + }
> + }
> +
> + return npages;
> +}
> +EXPORT_SYMBOL_GPL(make_device_exclusive_range);
> +
> void __put_anon_vma(struct anon_vma *anon_vma)
> {
> struct anon_vma *root = anon_vma->root;
> --
> 2.20.1
>
>
--
Peter Xu
d0
>kthread+0x114/0x130
>? kthread_create_worker_on_cpu+0x40/0x40
>ret_from_fork+0x1f/0x30
> CR2: 0064
>
> This commit fixes the problem by using the mm pointer passed to the
> function rather than the bogus one in current.
>
> Fixes: 008cfe44
53 matches
Mail list logo