On Tue, Aug 18, 2015 at 5:51 PM, Andrey Korolyov <and...@xdel.ru> wrote: > "Fixed" with cherry-pick of the > 7a72f7a140bfd3a5dae73088947010bfdbcf6a40 and its predecessor > 7103f60de8bed21a0ad5d15d2ad5b7a333dda201. Of course this is not a real > fix as the only race precondition is shifted/disappeared by a clear > assumption. Though there are not too many hotplug users around, I hope > this information would be useful for those who would experience the > same in a next year or so, until 3.18+ will be stable enough for > hypervisor kernel role. Any suggestions on a further debug/race > re-exposition are of course very welcomed. > > CCing kvm@ as it looks as a hypervisor subsystem issue then. The > entire discussion can be found at > https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg03117.html .
So no, the issue is still there, though appearance rate is lower. What could be interesting, non-smp guests are affected as well, before that I suspected that the vCPUs has been resumed in a racy manner to trigger a memory corruption. Also the chance to hit the problem is increased at least faster than linear with number of plugged DIMMs, at 8G total it is almost impossible to catch the issue for now (which is better than the state of things at the beginning of this thread) and at 16G total reproduction has a fairly high rate with active memory operations. Migration of the suspended VM resulting in same corruption being seen, so it is very likely that the core analysis could reveal the root of the issue, the problem is that I have a zero clues of what exactly could be wrong there and how this thing could be dependent on a machine size, if we are not taking race conditions in a view.