David Hildenbrand <da...@redhat.com> writes:

> On 07.07.22 21:03, Alex Sierra wrote:
>> From: Alistair Popple <apop...@nvidia.com>
>>
>> Currently any attempts to pin a device coherent page will fail. This is
>> because device coherent pages need to be managed by a device driver, and
>> pinning them would prevent a driver from migrating them off the device.
>>
>> However this is no reason to fail pinning of these pages. These are
>> coherent and accessible from the CPU so can be migrated just like
>> pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin
>> them first try migrating them out of ZONE_DEVICE.
>>
>> Signed-off-by: Alistair Popple <apop...@nvidia.com>
>> Acked-by: Felix Kuehling <felix.kuehl...@amd.com>
>> [hch: rebased to the split device memory checks,
>>       moved migrate_device_page to migrate_device.c]
>> Signed-off-by: Christoph Hellwig <h...@lst.de>
>> ---
>>  mm/gup.c            | 47 +++++++++++++++++++++++++++++++++++-----
>>  mm/internal.h       |  1 +
>>  mm/migrate_device.c | 53 +++++++++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 96 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index b65fe8bf5af4..9b6b9923d22d 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -1891,9 +1891,43 @@ static long check_and_migrate_movable_pages(unsigned 
>> long nr_pages,
>>                      continue;
>>              prev_folio = folio;
>>
>> -            if (folio_is_longterm_pinnable(folio))
>> +            /*
>> +             * Device private pages will get faulted in during gup so it
>> +             * shouldn't be possible to see one here.
>> +             */
>> +            if (WARN_ON_ONCE(folio_is_device_private(folio))) {
>> +                    ret = -EFAULT;
>> +                    goto unpin_pages;
>> +            }
>
> I'd just drop that. Device private pages are never part of a present PTE. So 
> if we
> could actually get a grab of one via GUP we would be in bigger trouble ...
> already before this patch.

Fair.

>> +
>> +            /*
>> +             * Device coherent pages are managed by a driver and should not
>> +             * be pinned indefinitely as it prevents the driver moving the
>> +             * page. So when trying to pin with FOLL_LONGTERM instead try
>> +             * to migrate the page out of device memory.
>> +             */
>> +            if (folio_is_device_coherent(folio)) {
>> +                    WARN_ON_ONCE(PageCompound(&folio->page));
>
> Maybe that belongs into migrate_device_page()?

Ok (noting Matthew's comment there as well).

>> +
>> +                    /*
>> +                     * Migration will fail if the page is pinned, so convert
>
> [...]
>
>>  /*
>>   * mm/gup.c
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index cf9668376c5a..5decd26dd551 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -794,3 +794,56 @@ void migrate_vma_finalize(struct migrate_vma *migrate)
>>      }
>>  }
>>  EXPORT_SYMBOL(migrate_vma_finalize);
>> +
>> +/*
>> + * Migrate a device coherent page back to normal memory.  The caller should 
>> have
>> + * a reference on page which will be copied to the new page if migration is
>> + * successful or dropped on failure.
>> + */
>> +struct page *migrate_device_page(struct page *page, unsigned int gup_flags)
>
> Function name should most probably indicate that we're dealing with coherent 
> pages here?

Ok.

>> +{
>> +    unsigned long src_pfn, dst_pfn = 0;
>> +    struct migrate_vma args;
>> +    struct page *dpage;
>> +
>> +    lock_page(page);
>> +    src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE;
>> +    args.src = &src_pfn;
>> +    args.dst = &dst_pfn;
>> +    args.cpages = 1;
>> +    args.npages = 1;
>> +    args.vma = NULL;
>> +    migrate_vma_setup(&args);
>> +    if (!(src_pfn & MIGRATE_PFN_MIGRATE))
>> +            return NULL;
>
> Wow, these refcount and page locking/unlocking rules with this migrate_* api 
> are
> confusing now. And the usage here of sometimes returning and sometimes falling
> trough don't make it particularly easier to understand here.
>
> I'm not 100% happy about reusing migrate_vma_setup() usage if there *is no 
> VMA*.
> That's just absolutely confusing, because usually migrate_vma_setup() itself
> would do the collection step and ref+lock pages. :/
>
> In general, I can see why/how we're reusing the migrate_vma_* API here, but 
> there
> is absolutely no VMA ... not sure what to improve besides providing a second 
> API
> that does a simple single-page migration. But that can be changed later ...

Yeah, as noted in your other response I think it should be ok to just
call migrate_vma_unmap() directly from migrate_device_page() so I assume
that would adequately deal with this.

>> +
>> +    dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0);
>> +
>
> alloc_page()
>
>> +    /*
>> +     * get/pin the new page now so we don't have to retry gup after
>> +     * migrating. We already have a reference so this should never fail.
>> +     */
>> +    if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) {
>> +            __free_pages(dpage, 0);
>
> __free_page()
>
>> +            dpage = NULL;
>> +    }
>
> Hm, this means that we are not pinning via the PTE at hand, but via something
> we expect migration to put into the PTE. I'm not really happy about this.
>
> Ideally, we'd make the pinning decision only on the actual GUP path, not in 
> here.
> Just like in the migrate_pages() case, where we end up dropping all refs/pins
> and looking up again via GUP from the PTE.
>
> For example, I wonder if something nasty could happen if the PTE got mapped
> R/O in the meantime and you're pinning R/W here ...
>
> TBH, all this special casing on gup_flags here is nasty. Please, let's just do
> it like migrate_pages() and do another GUP walk. Absolutely no need to 
> optimize.

The only reason to pass gup_flags is to check FOLL_PIN vs. FOLL_GET so
that we can do the right reference on the destination page. I did the
optimisation because we already have the destination page with a
reference and GUP/PUP does not make any guarantees about the current PTE
state anyway.

However I noticed there might be a race here - during migration we
replace present PTEs with migration entries. On fork these get copied
via copy_nonpresent_pte() and made read-only. However we don't check if
the page a migration entry points to is pinned or not. For an ordinary
PTE copy_present_pte() would copy the page for a COW mapping, but this
won't happen if the page happens to be undergoing migration (even though
the migration will ultimately fail due to the pin).

Anyway I don't think this patch currently makes that any worse, but if
we fix the above it will because there is a brief period during which
the page we're pinning won't look like a pinned page.

So I will go with the suggestion to do another GUP walk.

> [...]
>
>
>
> I'd go with something like the following on top (which does not touch on the
> general semantic issue with migrate_vma_* ). Note that I most probably messed
> up some refcount/lock handling and that it's broken.
> Just to give you an idea what I think could be cleaner.

Thanks! At a glance it looks roughly right but I will check and respin
it to incorporate the comments.

> diff --git a/mm/gup.c b/mm/gup.c
> index 9b6b9923d22d..17041b3e605e 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1881,7 +1881,7 @@ static long check_and_migrate_movable_pages(unsigned 
> long nr_pages,
>       unsigned long isolation_error_count = 0, i;
>       struct folio *prev_folio = NULL;
>       LIST_HEAD(movable_page_list);
> -     bool drain_allow = true;
> +     bool drain_allow = true, any_device_coherent = false;
>       int ret = 0;
>
>       for (i = 0; i < nr_pages; i++) {
> @@ -1891,15 +1891,6 @@ static long check_and_migrate_movable_pages(unsigned 
> long nr_pages,
>                       continue;
>               prev_folio = folio;
>
> -             /*
> -              * Device private pages will get faulted in during gup so it
> -              * shouldn't be possible to see one here.
> -              */
> -             if (WARN_ON_ONCE(folio_is_device_private(folio))) {
> -                     ret = -EFAULT;
> -                     goto unpin_pages;
> -             }
> -
>               /*
>                * Device coherent pages are managed by a driver and should not
>                * be pinned indefinitely as it prevents the driver moving the
> @@ -1907,7 +1898,12 @@ static long check_and_migrate_movable_pages(unsigned 
> long nr_pages,
>                * to migrate the page out of device memory.
>                */
>               if (folio_is_device_coherent(folio)) {
> -                     WARN_ON_ONCE(PageCompound(&folio->page));
> +                     /*
> +                      * We always want a new GUP lookup with device coherent
> +                      * pages.
> +                      */
> +                     any_device_coherent = true;
> +                     pages[i] = 0;
>
>                       /*
>                        * Migration will fail if the page is pinned, so convert
> @@ -1918,11 +1914,12 @@ static long check_and_migrate_movable_pages(unsigned 
> long nr_pages,
>                               unpin_user_page(&folio->page);
>                       }
>
> -                     pages[i] = migrate_device_page(&folio->page, gup_flags);
> -                     if (!pages[i]) {
> -                             ret = -EBUSY;
> +                     ret = migrate_device_coherent_page(&folio->page);
> +                     if (ret)
>                               goto unpin_pages;
> -                     }
> +                     /* The reference to our folio is stale now. */
> +                     prev_folio = NULL;
> +                     folio = NULL;
>                       continue;
>               }
>
> @@ -1953,7 +1950,8 @@ static long check_and_migrate_movable_pages(unsigned 
> long nr_pages,
>                                   folio_nr_pages(folio));
>       }
>
> -     if (!list_empty(&movable_page_list) || isolation_error_count)
> +     if (!list_empty(&movable_page_list) || isolation_error_count ||
> +         any_device_coherent)
>               goto unpin_pages;
>
>       /*
> @@ -1963,14 +1961,19 @@ static long check_and_migrate_movable_pages(unsigned 
> long nr_pages,
>       return nr_pages;
>
>  unpin_pages:
> -     for (i = 0; i < nr_pages; i++) {
> -             if (!pages[i])
> -                     continue;
> +     /* We have to be careful if we stumbled over device coherent pages. */
> +     if (unlikely(any_device_coherent || !(gup_flags & FOLL_PIN))) {
> +             for (i = 0; i < nr_pages; i++) {
> +                     if (!pages[i])
> +                             continue;
>
> -             if (gup_flags & FOLL_PIN)
> -                     unpin_user_page(pages[i]);
> -             else
> -                     put_page(pages[i]);
> +                     if (gup_flags & FOLL_PIN)
> +                             unpin_user_page(pages[i]);
> +                     else
> +                             put_page(pages[i]);
> +             }
> +     } else {
> +             unpin_user_pages(pages, nr_pages);
>       }
>
>       if (!list_empty(&movable_page_list)) {
> diff --git a/mm/internal.h b/mm/internal.h
> index eeab4ee7a4a3..899dab512c5a 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -853,7 +853,7 @@ int numa_migrate_prep(struct page *page, struct 
> vm_area_struct *vma,
>                     unsigned long addr, int page_nid, int *flags);
>
>  void free_zone_device_page(struct page *page);
> -struct page *migrate_device_page(struct page *page, unsigned int gup_flags);
> +int migrate_device_coherent_page(struct page *page);
>
>  /*
>   * mm/gup.c
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 5decd26dd551..dfb78ea3d326 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -797,53 +797,40 @@ EXPORT_SYMBOL(migrate_vma_finalize);
>
>  /*
>   * Migrate a device coherent page back to normal memory.  The caller should 
> have
> - * a reference on page which will be copied to the new page if migration is
> - * successful or dropped on failure.
> + * a reference on page, which will be dropped on return.
>   */
> -struct page *migrate_device_page(struct page *page, unsigned int gup_flags)
> +int migrate_device_coherent_page(struct page *page)
>  {
>       unsigned long src_pfn, dst_pfn = 0;
> -     struct migrate_vma args;
> +     struct migrate_vma args = {
> +             .src = &src_pfn,
> +             .dst = &dst_pfn,
> +             .cpages = 1,
> +             .npages = 1,
> +             .vma = NULL,
> +     };
>       struct page *dpage;
>
> +     VM_WARN_ON_ONCE(PageCompound(page));
> +
>       lock_page(page);
>       src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE;
> -     args.src = &src_pfn;
> -     args.dst = &dst_pfn;
> -     args.cpages = 1;
> -     args.npages = 1;
> -     args.vma = NULL;
> -     migrate_vma_setup(&args);
> -     if (!(src_pfn & MIGRATE_PFN_MIGRATE))
> -             return NULL;
> -
> -     dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0);
> -
> -     /*
> -      * get/pin the new page now so we don't have to retry gup after
> -      * migrating. We already have a reference so this should never fail.
> -      */
> -     if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) {
> -             __free_pages(dpage, 0);
> -             dpage = NULL;
> -     }
>
> -     if (dpage) {
> -             lock_page(dpage);
> -             dst_pfn = migrate_pfn(page_to_pfn(dpage));
> +     migrate_vma_setup(&args);
> +     if (src_pfn & MIGRATE_PFN_MIGRATE) {
> +             dpage = alloc_page(GFP_USER | __GFP_NOWARN);
> +             if (dpage) {
> +                     dst_pfn = migrate_pfn(page_to_pfn(dpage));
> +                     lock_page(dpage);
> +             }
>       }
>
>       migrate_vma_pages(&args);
>       if (src_pfn & MIGRATE_PFN_MIGRATE)
>               copy_highpage(dpage, page);
>       migrate_vma_finalize(&args);
> -     if (dpage && !(src_pfn & MIGRATE_PFN_MIGRATE)) {
> -             if (gup_flags & FOLL_PIN)
> -                     unpin_user_page(dpage);
> -             else
> -                     put_page(dpage);
> -             dpage = NULL;
> -     }
>
> -     return dpage;
> +     if (src_pfn & MIGRATE_PFN_MIGRATE)
> +             return 0;
> +     return -EBUSY;
>  }
> --
> 2.35.3

Reply via email to