Hello, On Wed, Apr 26, 2017 at 08:04:43PM +0100, Dr. David Alan Gilbert wrote: > * Christian Borntraeger (borntrae...@de.ibm.com) wrote: > > On 04/26/2017 08:37 PM, Dr. David Alan Gilbert (git) wrote: > > > From: "Dr. David Alan Gilbert" <dgilb...@redhat.com> > > > > > > When an all-zero page is received during the precopy > > > phase of a postcopy-enabled migration we must force > > > allocation otherwise accesses to the page will still > > > get blocked by userfault. > > > > > > Symptom: > > > a) If the page is accessed by a device during device-load > > > then we get a deadlock as the source finishes sending > > > all its pages but the destination device-load is still > > > paused and so doesn't clean up. > > > > > > b) If the page is accessed later, then the thread will stay > > > paused until the end of migration rather than carrying on > > > running, until we release userfault at the end. > > > > > > Signed-off-by: Dr. David Alan Gilbert <dgilb...@redhat.com> > > > Reported-by: Christian Borntraeger <borntrae...@de.ibm.com> > > > > CC stable? after all the guest hangs on both sides > > > > Has survived 40 migrations (usually failed at the 2nd) > > Tested-by: Christian Borntraeger <borntrae...@de.ibm.com> > > Great...but..... > Andrea (added to the mail) says this shouldn't be necessary. > The read we were doing in the is_zero_range() should have been sufficient > to get the page mapped and that zero page should have survived. > > So - I guess that's back a step, we need to figure out why the > page disapepars for you.
Yes reading during precopy is enough to fill the hole and prevent userfault missing faults to trigger. Somehow the pagetable must be mapped by a zeropage or a hugezeropage or a regular page allocated during a previous precopy pass or a pre-zeroed subpage part of a THP. Even if the hugezeropage is splitted later by a MADV_DONTNEED with postcopy starts, they will become 4k zeropages. After a read succeeds, nothing (except MADV_DONTNEED or other explicit syscalls which qemu would need to invoke explicitly between is_zero_range and UFFDIO_REGISTER) should be able to bring the pagetable back to its "pte_none/pmd_none" state that will then trigger missing userfaults during postcopy later. Thanks, Andrea