On Fri, Nov 24, 2017 at 10:57:11AM +0800, Jason Wang wrote: > On 2017年11月23日 18:59, Stefan Hajnoczi wrote: > > On Thu, Nov 23, 2017 at 11:37:46AM +0800, Jason Wang wrote: > > > Guest state should not be touched if VM is stopped, unfortunately we > > > didn't check running state and tried to drain tx queue unconditionally > > > in virtio_net_set_status(). A crash was then noticed as a migration > > > destination when user type quit after virtqueue state is loaded but > > > before region cache is initialized. In this case, > > > virtio_net_drop_tx_queue_data() tries to access the uninitialized > > > region cache. > > > > > > Fix this by only dropping tx queue data when vm is running. > > hw/virtio/virtio.c:virtio_load() does the following: > > > > for (i = 0; i < num; i++) { > > if (vdev->vq[i].vring.desc) { > > uint16_t nheads; > > > > /* > > * VIRTIO-1 devices migrate desc, used, and avail ring addresses > > so > > * only the region cache needs to be set up. Legacy devices need > > * to calculate used and avail ring addresses based on the desc > > * address. > > */ > > if (virtio_vdev_has_feature(vdev, VIRTIO_F_VERSION_1)) { > > virtio_init_region_cache(vdev, i); > > } else { > > virtio_queue_update_rings(vdev, i); > > } > > > > So the region caches should be initialized after virtqueue state is > > loaded. > > > > It's unclear to me which code path triggers this issue. Can you add a > > backtrace or an explanation? > > > > Thanks, > > Stefan > > Migration coroutine was yield before region cache was initialized. The > backtrace looks like: [...] > #16 0x0000555555b1c199 in vmstate_load_state (f=0x555556f7c010, > vmsd=0x5555562b8160 <vmstate_virtio>, opaque=0x555557d68610, version_id=1) > at migration/vmstate.c:160 > #17 0x0000555555865cc3 in virtio_load (vdev=0x555557d68610, > f=0x555556f7c010, version_id=11) at > /home/devel/git/qemu/hw/virtio/virtio.c:2110
Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com> Thanks for the backtrace! Your patch is fine but I have a larger concern: The backtrace shows that the virtio code is re-entrant during savevm load. That's probably a bad thing because set_status() and other APIs are probably not intended to run while we are half-way through savevm load. The virtqueue is only partially set up at this point :(. I wonder if a more general cleanup is necessary to avoid problems like this in the future... Stefan
signature.asc
Description: PGP signature