On Tue, 16 Jun 2015 17:41:03 +0300 Andrey Korolyov <and...@xdel.ru> wrote:
> > Answering back to myself - I made a wrong statement before, the > > physical mapping *are* different with different cases, of course! > > Therefore, the issue looks much simpler and I`d have a patch over a > > couple of days if nobody fix this earlier. > > > > ... and another (possibly last) update. This is not a memory > misplacement but a quirky race - if no internal workload applied to > the virtual machine during migration, no harm is done - after its > completion the VM passes all imaginable tests as well. If no device > I/O is involved (perf bench numa for 1000s), guest is not falling with > guts out as with disk-test case, just crashing a process instead: > > [ 784.613032] thread 0/0[2750]: segfault at 0 ip (null) sp > 00007fda4ea6a940 error 14 in perf_3.16[400000+144000] > > I think we are facing a very interesting memory access race during a > live migration but there are no visible reasons for it to be bound > only for a runtime-plugged memory case. All possible cases where > either userspace or kernel driver are involved showing null as an > instruction pointer for trace, may be this can be a hint for someone. I've checked logs, so far I don't see anything suspicious there except of "acpi PNP0C80:00: Already enumerated" lines, probably rising log level might show more info + upload full logs + enable ACPI debug info to so that dimm device's _CRS would show up + QEMU's CLI that was used to produce such log wrt migration: could you provide exact CLI args on source and destination along with used intermediate mem hotplug commands or even better if it's just reproduced with migration of cold-plugged dimm-s for simplification + steps to reproduce (and guest kernel versions).