On Mon, Oct 9, 2023 at 5:57 PM Lokesh Gidra <lokeshgi...@google.com> wrote:
>
> On Mon, Oct 9, 2023 at 9:29 AM Lokesh Gidra <lokeshgi...@google.com> wrote:
> >
> > On Mon, Oct 9, 2023 at 5:24 PM David Hildenbrand <da...@redhat.com> wrote:
> > >
> > > On 09.10.23 18:21, Suren Baghdasaryan wrote:
> > > > On Mon, Oct 9, 2023 at 7:38 AM David Hildenbrand <da...@redhat.com> 
> > > > wrote:
> > > >>
> > > >> On 09.10.23 08:42, Suren Baghdasaryan wrote:
> > > >>> From: Andrea Arcangeli <aarca...@redhat.com>
> > > >>>
> > > >>> Implement the uABI of UFFDIO_MOVE ioctl.
> > > >>> UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application
> > > >>> needs pages to be allocated [1]. However, with UFFDIO_MOVE, if pages 
> > > >>> are
> > > >>> available (in userspace) for recycling, as is usually the case in heap
> > > >>> compaction algorithms, then we can avoid the page allocation and 
> > > >>> memcpy
> > > >>> (done by UFFDIO_COPY). Also, since the pages are recycled in the
> > > >>> userspace, we avoid the need to release (via madvise) the pages back 
> > > >>> to
> > > >>> the kernel [2].
> > > >>> We see over 40% reduction (on a Google pixel 6 device) in the 
> > > >>> compacting
> > > >>> thread’s completion time by using UFFDIO_MOVE vs. UFFDIO_COPY. This 
> > > >>> was
> > > >>> measured using a benchmark that emulates a heap compaction 
> > > >>> implementation
> > > >>> using userfaultfd (to allow concurrent accesses by application 
> > > >>> threads).
> > > >>> More details of the usecase are explained in [2].
> > > >>> Furthermore, UFFDIO_MOVE enables moving swapped-out pages without
> > > >>> touching them within the same vma. Today, it can only be done by 
> > > >>> mremap,
> > > >>> however it forces splitting the vma.
> > > >>>
> > > >>> [1] 
> > > >>> https://lore.kernel.org/all/1425575884-2574-1-git-send-email-aarca...@redhat.com/
> > > >>> [2] 
> > > >>> https://lore.kernel.org/linux-mm/ca+eeso4uo84ssnbharh4hvlnhauq5nzknkxqxrcyjninvjp...@mail.gmail.com/
> > > >>>
> > > >>> Update for the ioctl_userfaultfd(2)  manpage:
> > > >>>
> > > >>>      UFFDIO_MOVE
> > > >>>          (Since Linux xxx)  Move a continuous memory chunk into the
> > > >>>          userfault registered range and optionally wake up the blocked
> > > >>>          thread. The source and destination addresses and the number 
> > > >>> of
> > > >>>          bytes to move are specified by the src, dst, and len fields 
> > > >>> of
> > > >>>          the uffdio_move structure pointed to by argp:
> > > >>>
> > > >>>              struct uffdio_move {
> > > >>>                  __u64 dst;    /* Destination of move */
> > > >>>                  __u64 src;    /* Source of move */
> > > >>>                  __u64 len;    /* Number of bytes to move */
> > > >>>                  __u64 mode;   /* Flags controlling behavior of move 
> > > >>> */
> > > >>>                  __s64 move;   /* Number of bytes moved, or negated 
> > > >>> error */
> > > >>>              };
> > > >>>
> > > >>>          The following value may be bitwise ORed in mode to change the
> > > >>>          behavior of the UFFDIO_MOVE operation:
> > > >>>
> > > >>>          UFFDIO_MOVE_MODE_DONTWAKE
> > > >>>                 Do not wake up the thread that waits for page-fault
> > > >>>                 resolution
> > > >>>
> > > >>>          UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES
> > > >>>                 Allow holes in the source virtual range that is being 
> > > >>> moved.
> > > >>>                 When not specified, the holes will result in ENOENT 
> > > >>> error.
> > > >>>                 When specified, the holes will be accounted as 
> > > >>> successfully
> > > >>>                 moved memory. This is mostly useful to move hugepage 
> > > >>> aligned
> > > >>>                 virtual regions without knowing if there are 
> > > >>> transparent
> > > >>>                 hugepages in the regions or not, but preventing the 
> > > >>> risk of
> > > >>>                 having to split the hugepage during the operation.
> > > >>>
> > > >>>          The move field is used by the kernel to return the number of
> > > >>>          bytes that was actually moved, or an error (a negated errno-
> > > >>>          style value).  If the value returned in move doesn't match 
> > > >>> the
> > > >>>          value that was specified in len, the operation fails with the
> > > >>>          error EAGAIN.  The move field is output-only; it is not read 
> > > >>> by
> > > >>>          the UFFDIO_MOVE operation.
> > > >>>
> > > >>>          The operation may fail for various reasons. Usually, 
> > > >>> remapping of
> > > >>>          pages that are not exclusive to the given process fail; once 
> > > >>> KSM
> > > >>>          might deduplicate pages or fork() COW-shares pages during 
> > > >>> fork()
> > > >>>          with child processes, they are no longer exclusive. Further, 
> > > >>> the
> > > >>>          kernel might only perform lightweight checks for detecting 
> > > >>> whether
> > > >>>          the pages are exclusive, and return -EBUSY in case that 
> > > >>> check fails.
> > > >>>          To make the operation more likely to succeed, KSM should be
> > > >>>          disabled, fork() should be avoided or MADV_DONTFORK should be
> > > >>>          configured for the source VMA before fork().
> > > >>>
> > > >>>          This ioctl(2) operation returns 0 on success.  In this case, 
> > > >>> the
> > > >>>          entire area was moved.  On error, -1 is returned and errno is
> > > >>>          set to indicate the error.  Possible errors include:
> > > >>>
> > > >>>          EAGAIN The number of bytes moved (i.e., the value returned in
> > > >>>                 the move field) does not equal the value that was
> > > >>>                 specified in the len field.
> > > >>>
> > > >>>          EINVAL Either dst or len was not a multiple of the system 
> > > >>> page
> > > >>>                 size, or the range specified by src and len or dst 
> > > >>> and len
> > > >>>                 was invalid.
> > > >>>
> > > >>>          EINVAL An invalid bit was specified in the mode field.
> > > >>>
> > > >>>          ENOENT
> > > >>>                 The source virtual memory range has unmapped holes and
> > > >>>                 UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES is not set.
> > > >>>
> > > >>>          EEXIST
> > > >>>                 The destination virtual memory range is fully or 
> > > >>> partially
> > > >>>                 mapped.
> > > >>>
> > > >>>          EBUSY
> > > >>>                 The pages in the source virtual memory range are not
> > > >>>                 exclusive to the process. The kernel might only 
> > > >>> perform
> > > >>>                 lightweight checks for detecting whether the pages are
> > > >>>                 exclusive. To make the operation more likely to 
> > > >>> succeed,
> > > >>>                 KSM should be disabled, fork() should be avoided or
> > > >>>                 MADV_DONTFORK should be configured for the source 
> > > >>> virtual
> > > >>>                 memory area before fork().
> > > >>>
> > > >>>          ENOMEM Allocating memory needed for the operation failed.
> > > >>>
> > > >>>          ESRCH
> > > >>>                 The faulting process has exited at the time of a
> > > >>>                 UFFDIO_MOVE operation.
> > > >>>
> > > >>
> > > >> A general comment simply because I realized that just now: does 
> > > >> anything
> > > >> speak against limiting the operations now to a single MM?
> > > >>
> > > >> The use cases I heard so far don't need it. If ever required, we could
> > > >> consider extending it.
> > > >>
> > > >> Let's reduce complexity and KIS unless really required.
> > > >
> > > > Let me check if there are use cases that require moves between MMs.
> > > > Andrea seems to have put considerable effort to make it work between
> > > > MMs and it would be a pity to lose that. I can send a follow-up patch
> > > > to recover that functionality and even if it does not get merged, it
> > > > can be used in the future as a reference. But first let me check if we
> > > > can drop it.
> >
> > For the compaction use case that we have it's fine to limit it to
> > single MM. However, for general use I think Peter will have a better
> > idea.
> > >
> > > Yes, that sounds reasonable. Unless the big important use cases requires
> > > moving pages between processes, let's leave that as future work for now.
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> > >
>
> While going through mremap's move_page_tables code, which is pretty
> similar to what we do here, I noticed that cache is flushed as well,
> whereas we are not doing that here. Is that OK? I'm not a MM expert by
> any means, so it's a question rather than a comment :)

Good question. I'll have to look closer into it. Unfortunately I'll be
travelling starting tomorrow and be back next week. Will try my best
to answer questions in a timely manner but depends on my connection
and availability.
Thanks!

Reply via email to