Hey,
I work for TransIP and we host a VPS platform based on QEMU/KVM. We are
currently running qemu 2.4.0. A few days ago we noticed that live
migrations for some of our VM's would fail. Further investigation turned
out it was specific to windows server 2012, caused by the fact that the
standard virtio driver from RedHat was replaced in windows updates by a
driver called "Midfin eFabric" (this driver doesn't really seem to be
meant for virtio, we have a case running at MicroSoft about that). Once
we knew how to reproduce we tested this on QEMU 2.6.0 as well and it
also seems to be affected (later we found out that 2.4.0 to 2.6.0
migration does work probably due to pure luck).
We started investigating the problem in QEMU 2.4.0 and noticed it was
caused by the fact that virtio_net_device_load requires certain feature
flags to be set, specifically to load curr_guest_offloads which is only
written and read if the VIRTIO_NET_F_CTRL_GUEST_OFFLOADS flag is set,
but those flags are set in virtio_load after the call to
virtio_net_device_load. Moving the code setting the feature flags before
the call to virtio_net_device_load fixes it, however it introduces
another problem. Virtio can have 64-bits feature flags, however the
standard save payload for virtio only has space for 32-bits feature
flags. This was solved by putting those in a subsection of the
vmstate_save_state stuff. Unfortunately this is called (and thus binary
offset located) after the virtio_net_device_load code.
There was an attempt to fix this in QEMU 2.6.0. However, this seems to
have broken it worse. The write code (virtio_net_save, virtio_save and
virtio_net_save_device) still puts the curr_guest_offloads value before
the vmstate_save_state data. However the read code expects and tries to
read it after the vmstate_save_state data. Should we just also change
the virtio_net_save code to have it follow the same order as
virtio_net_load? Or will this potentially break more stuff.
Regards,
Robin Geuze
TransIP BV