Re: [Qemu-devel] Async savevm using userfaultfd(2)

Andrea Arcangeli Thu, 13 Oct 2016 07:29:18 -0700

Hello,

On Thu, Oct 13, 2016 at 09:30:49AM +0100, Dr. David Alan Gilbert wrote:
> I think it should, or at least I think all other kernel things end up being
> caught by userfaultfd during postcopy.


Yes indeed, it will work. vhost blocks in its own task context inside
the kernel and the vmsave/postcopy live snapshotting thread will get
waken up, will copy the page off to a private snapshot buffer and then
mark the memory writable again and wakeup the vhost thread at the same
time.

The other showstopper limitation of mprotect is that you'd run out of
vmas and mprotect will eventually fail on large virtual machines. That
problem doesn't exist with userfaultfd WP.

Unfortunately it seems there's some problem with userfaultfd WP and
KVM but it is reported to work for regular userland memory. I didn't
get around solving that yet but it's work in progress... I thought of
finishing making userfaultfd WP fully accurate with special user bits
in the pagetables so there are no false positives. The problem is that
when we swapout a WP page, we would be forced to mark it readonly
during swapin if we don't store the WP information in the swapentry,
so I'm saving now the WP information in the swap entry. This also will
prevent false positive WP userfaults after fork() runs (fork would
mark the pagetable readonly so without an user bit marking which
pagetables are write protected we wouldn't know if we've to fault or
not).

The other advantage is that you can snapshot at 4k granularity by
selectively splitting THPs. The granularity of the snapshot process is
decided by userland. You decide if to copy 4k or 2m and depending on
that you will unwrprotect 4k or 2m, and the kernel will split the THP
if it's a THP but you unprotect only 4k of it. With userfaultfd it's
always userland (not the kernel) deciding the granularity of the
fault.

live snapshotting/async vmsave/redis snapshotting all need the same
thing and they're doing the same thing with uffd wp. And you most
certainly want to do faults at 4k granularity and let khugepaged
rebuild the splitted THP later on. Or you'd run in the same corner
case redis run into with THP because THP cows with 2M granularity by
default. It's faster and takes less memory to copy and unwrprotect
only 4k.

Other positive aspect of uffd WP is that you will decide the max
amount of "buffer" memory you are ok to use. If you set that "buffer"
to the size of the guest it'll behave like fork() so you will risk
100% higher mem utilization in the worst case. With fork() you are
forced to have 100% of the VM size free to succeed the
snapshotting/vmsaving. With userfaultfd you can decide and configure
it. Once the limit hits simply vmsave will behave synchronous and you
will have to wait the write() to disk to complete to free up one
buffer page, before you can copy off new data from guest to buffer and
then wakeup the tasks stuck in the kernel page fault waiting wakeup
from uffd.

I would suggest not to implement mprotect+sigsegv because maintaining
both APIs would be messy but mostly because mprotect cannot really
work for all cases and it would risk to fail at any time with
-ENOMEM. postcopy live migration had similar issues and this is why it
wasn't possible to achieve it reliably without userfaultfd.

In addition to this userfaultfd is much faster too, no signals, no
userland interprocess communication through pipe/unixsocket, no return
to userland for the task that hits the fault, schedule-in-kernel to
block which is cheaper and won't force mm tlbflushes, direct in-kernel
communication between the task that hits the fault and the vmsaveasync
thread (that can wait on epoll or anything) etc...

I'll look into fixing userfaultfd for the already implemented postcopy
live snapshotting ASAP, I've got a bugreport pending but until the WP
full accuracy is completed (to obsolete soft-dirty for good) the
current status of userfaultfd WP is not complete anyway.

And about future uffd features for other usages: once it will work
reliable and with full accuracy, one more feature possible would be to
add a userfaultfd async queue model too, where task won't need to
block and uffd msgs will be allocated and delivered to userland
asynchronously. That would obsolete soft dirty and reduce the
computational complexity as well, because you wouldn't need to scan a
worst case of 4TiB/4KiB pagetable entries to know which pages have
been modified during checkpoint/restore. Instead it'd provide the info
with the same efficiency that PML does in HW on Intel for guests (uffd
async WP would work for host and without HW support). Of course at the
next pass, you'd then would also need to wrprotect only those regions
that have been modified and not the whole range, and only userland
will know those ranges that need to be wrprotected again. A vectored
API would be quite nice for such selective wrprotection to reduce the
number of uffd ioctl to issue too, at the moment it's not vectored but
adding a vectored API will be only a simple interface issue once the
inner workings of uffd WP are solved.

On a side note about vectored APIs, we'd need a vectored madvise too
to optimize qemu in postcopy live snapshotting when it does the
MADV_DONTNEED to zap the last redirtied pages were we need to trigger
userfauls on them, that was already proposed upstream despite it
wasn't merged yet. Is target was jemalloc but we need it for postcopy
live migration as well.

Thanks,
Andrea

Re: [Qemu-devel] Async savevm using userfaultfd(2)

Reply via email to