Hello, On Thu, Oct 13, 2016 at 09:30:49AM +0100, Dr. David Alan Gilbert wrote: > I think it should, or at least I think all other kernel things end up being > caught by userfaultfd during postcopy.
Yes indeed, it will work. vhost blocks in its own task context inside the kernel and the vmsave/postcopy live snapshotting thread will get waken up, will copy the page off to a private snapshot buffer and then mark the memory writable again and wakeup the vhost thread at the same time. The other showstopper limitation of mprotect is that you'd run out of vmas and mprotect will eventually fail on large virtual machines. That problem doesn't exist with userfaultfd WP. Unfortunately it seems there's some problem with userfaultfd WP and KVM but it is reported to work for regular userland memory. I didn't get around solving that yet but it's work in progress... I thought of finishing making userfaultfd WP fully accurate with special user bits in the pagetables so there are no false positives. The problem is that when we swapout a WP page, we would be forced to mark it readonly during swapin if we don't store the WP information in the swapentry, so I'm saving now the WP information in the swap entry. This also will prevent false positive WP userfaults after fork() runs (fork would mark the pagetable readonly so without an user bit marking which pagetables are write protected we wouldn't know if we've to fault or not). The other advantage is that you can snapshot at 4k granularity by selectively splitting THPs. The granularity of the snapshot process is decided by userland. You decide if to copy 4k or 2m and depending on that you will unwrprotect 4k or 2m, and the kernel will split the THP if it's a THP but you unprotect only 4k of it. With userfaultfd it's always userland (not the kernel) deciding the granularity of the fault. live snapshotting/async vmsave/redis snapshotting all need the same thing and they're doing the same thing with uffd wp. And you most certainly want to do faults at 4k granularity and let khugepaged rebuild the splitted THP later on. Or you'd run in the same corner case redis run into with THP because THP cows with 2M granularity by default. It's faster and takes less memory to copy and unwrprotect only 4k. Other positive aspect of uffd WP is that you will decide the max amount of "buffer" memory you are ok to use. If you set that "buffer" to the size of the guest it'll behave like fork() so you will risk 100% higher mem utilization in the worst case. With fork() you are forced to have 100% of the VM size free to succeed the snapshotting/vmsaving. With userfaultfd you can decide and configure it. Once the limit hits simply vmsave will behave synchronous and you will have to wait the write() to disk to complete to free up one buffer page, before you can copy off new data from guest to buffer and then wakeup the tasks stuck in the kernel page fault waiting wakeup from uffd. I would suggest not to implement mprotect+sigsegv because maintaining both APIs would be messy but mostly because mprotect cannot really work for all cases and it would risk to fail at any time with -ENOMEM. postcopy live migration had similar issues and this is why it wasn't possible to achieve it reliably without userfaultfd. In addition to this userfaultfd is much faster too, no signals, no userland interprocess communication through pipe/unixsocket, no return to userland for the task that hits the fault, schedule-in-kernel to block which is cheaper and won't force mm tlbflushes, direct in-kernel communication between the task that hits the fault and the vmsaveasync thread (that can wait on epoll or anything) etc... I'll look into fixing userfaultfd for the already implemented postcopy live snapshotting ASAP, I've got a bugreport pending but until the WP full accuracy is completed (to obsolete soft-dirty for good) the current status of userfaultfd WP is not complete anyway. And about future uffd features for other usages: once it will work reliable and with full accuracy, one more feature possible would be to add a userfaultfd async queue model too, where task won't need to block and uffd msgs will be allocated and delivered to userland asynchronously. That would obsolete soft dirty and reduce the computational complexity as well, because you wouldn't need to scan a worst case of 4TiB/4KiB pagetable entries to know which pages have been modified during checkpoint/restore. Instead it'd provide the info with the same efficiency that PML does in HW on Intel for guests (uffd async WP would work for host and without HW support). Of course at the next pass, you'd then would also need to wrprotect only those regions that have been modified and not the whole range, and only userland will know those ranges that need to be wrprotected again. A vectored API would be quite nice for such selective wrprotection to reduce the number of uffd ioctl to issue too, at the moment it's not vectored but adding a vectored API will be only a simple interface issue once the inner workings of uffd WP are solved. On a side note about vectored APIs, we'd need a vectored madvise too to optimize qemu in postcopy live snapshotting when it does the MADV_DONTNEED to zap the last redirtied pages were we need to trigger userfauls on them, that was already proposed upstream despite it wasn't merged yet. Is target was jemalloc but we need it for postcopy live migration as well. Thanks, Andrea