On Thu, Mar 26, 2015 at 11:29:43AM +0100, Juan Quintela wrote: > Wen Congyang <we...@cn.fujitsu.com> wrote: > > On 03/25/2015 05:50 PM, Juan Quintela wrote: > >> zhanghailiang <zhang.zhanghaili...@huawei.com> wrote: > >>> Hi all, > >>> > >>> We found that, sometimes, the content of VM's memory is > >>> inconsistent between Source side and Destination side > >>> when we check it just after finishing migration but before VM continue to > >>> Run. > >>> > >>> We use a patch like bellow to find this issue, you can find it from affix, > >>> and Steps to reprduce: > >>> > >>> (1) Compile QEMU: > >>> ./configure --target-list=x86_64-softmmu --extra-ldflags="-lssl" && make > >>> > >>> (2) Command and output: > >>> SRC: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu > >>> qemu64,-kvmclock -netdev tap,id=hn0-device > >>> virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive > >>> file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe > >>> -device > >>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 > >>> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet > >>> -monitor stdio > >> > >> Could you try to reproduce: > >> - without vhost > >> - without virtio-net > >> - cache=unsafe is going to give you trouble, but trouble should only > >> happen after migration of pages have finished. > > > > If I use ide disk, it doesn't happen. > > Even if I use virtio-net with vhost=on, it still doesn't happen. I guess > > it is because I migrate the guest when it is booting. The virtio net > > device is not used in this case. > > Kevin, Stefan, Michael, any great idea? > > Thanks, Juan.
If this is during boot from disk, we can more or less rule out virtio-net/vhost-net. > > > > Thanks > > Wen Congyang > > > >> > >> What kind of load were you having when reproducing this issue? > >> Just to confirm, you have been able to reproduce this without COLO > >> patches, right? > >> > >>> (qemu) migrate tcp:192.168.3.8:3004 > >>> before saving ram complete > >>> ff703f6889ab8701e4e040872d079a28 > >>> md_host : after saving ram complete > >>> ff703f6889ab8701e4e040872d079a28 > >>> > >>> DST: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu > >>> qemu64,-kvmclock -netdev tap,id=hn0,vhost=on -device > >>> virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive > >>> file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe > >>> -device > >>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 > >>> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet > >>> -monitor stdio -incoming tcp:0:3004 > >>> (qemu) QEMU_VM_SECTION_END, after loading ram > >>> 230e1e68ece9cd4e769630e1bcb5ddfb > >>> md_host : after loading all vmstate > >>> 230e1e68ece9cd4e769630e1bcb5ddfb > >>> md_host : after cpu_synchronize_all_post_init > >>> 230e1e68ece9cd4e769630e1bcb5ddfb > >>> > >>> This happens occasionally, and it is more easy to reproduce when > >>> issue migration command during VM's startup time. > >> > >> OK, a couple of things. Memory don't have to be exactly identical. > >> Virtio devices in particular do funny things on "post-load". There > >> aren't warantees for that as far as I know, we should end with an > >> equivalent device state in memory. > >> > >>> We have done further test and found that some pages has been > >>> dirtied but its corresponding migration_bitmap is not set. > >>> We can't figure out which modules of QEMU has missed setting bitmap > >>> when dirty page of VM, > >>> it is very difficult for us to trace all the actions of dirtying VM's > >>> pages. > >> > >> This seems to point to a bug in one of the devices. > >> > >>> Actually, the first time we found this problem was in the COLO FT > >>> development, and it triggered some strange issues in > >>> VM which all pointed to the issue of inconsistent of VM's > >>> memory. (We have try to save all memory of VM to slave side every > >>> time > >>> when do checkpoint in COLO FT, and everything will be OK.) > >>> > >>> Is it OK for some pages that not transferred to destination when do > >>> migration ? Or is it a bug? > >> > >> Pages transferred should be the same, after device state transmission is > >> when things could change. > >> > >>> This issue has blocked our COLO development... :( > >>> > >>> Any help will be greatly appreciated! > >> > >> Later, Juan. > >>