From: Prasad Pandit <p...@fedoraproject.org> Hello,
* virsh(1) offers multiple options to initiate Postcopy migration: 1) virsh migrate --postcopy --postcopy-after-precopy 2) virsh migrate --postcopy + virsh migrate-postcopy 3) virsh migrate --postcopy --timeout <N> --timeout-postcopy When Postcopy migration is invoked via method (2) or (3) above, the guest on the destination host seems to hang or get stuck sometimes. * During Postcopy migration, multiple threads are spawned on the destination host to start the guest and setup devices. One such thread starts vhost device via vhost_dev_start() function and another called fault_thread handles page faults in user space using kernel's userfaultfd(2) system. When fault_thread exits upon completion of Postcopy migration, it sends a 'postcopy_end' message to the vhost-user device. But sometimes 'postcopy_end' message is sent while vhost device is being setup via vhost_dev_start(). Thread-1 Thread-2 vhost_dev_start postcopy_ram_incoming_cleanup vhost_device_iotlb_miss postcopy_notify vhost_backend_update_device_iotlb vhost_user_postcopy_notifier vhost_user_send_device_iotlb_msg vhost_user_postcopy_end process_message_reply process_message_reply vhost_user_read vhost_user_read vhost_user_read_header vhost_user_read_header "Fail to update device iotlb" "Failed to receive reply to postcopy_end" This creates confusion when vhost device receives 'postcopy_end' message while it is still trying to update IOTLB entries. This seems to leave the guest in a stranded/hung state because fault_thread has exited saying Postcopy migration has ended, but vhost-device is probably still expecting updates. QEMU logs following errors on the destination host === ... qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. Flags 0x0 instead of 0x5. qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb qemu-kvm: vhost_user_postcopy_end: 700871,700900: Failed to receive reply to postcopy_end qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. Flags 0x0 instead of 0x5. qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. Flags 0x8 instead of 0x5. qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. Flags 0x16 instead of 0x5. qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. Flags 0x0 instead of 0x5. qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb === * Couple of patches here help to fix/handle these errors. Thank you. --- Prasad Pandit (2): vhost-user: add a write-read lock vhost: fail device start if iotlb update fails hw/virtio/vhost-user.c | 423 +++++++++++++++++++-------------- hw/virtio/vhost.c | 6 +- include/hw/virtio/vhost-user.h | 3 + 3 files changed, 259 insertions(+), 173 deletions(-) -- 2.45.2