On Thu, Apr 17, 2025 at 01:05:37PM -0300, Fabiano Rosas wrote: > It's not that page faults happen during multifd. The page was already > sent during precopy, but multifd-recv didn't write to it, it just marked > the receivedmap. When postcopy starts, the page gets accessed and > faults. Since postcopy is on, the migration wants to request the page > from the source, but it's present in the receivedmap, so it doesn't > ask. No page ever comes and the code hangs waiting for the page fault to > be serviced (or potentially faults continuously? I'm not sure on the > details).
I think your previous analysis is correct on the zero pages. I am not 100% sure if that's the issue but very likely. I tend to also agree with you that we could skip zero page optimization in multifd code when postcopy is enabled (maybe plus some comment right above..). Thanks, -- Peter Xu