On Wed, Aug 28, 2024 at 03:39:12PM +0530, Prasad Pandit wrote: > From: Prasad Pandit <p...@fedoraproject.org> > > Hello, > > * virsh(1) offers multiple options to initiate Postcopy migration: > > 1) virsh migrate --postcopy --postcopy-after-precopy > 2) virsh migrate --postcopy + virsh migrate-postcopy > 3) virsh migrate --postcopy --timeout <N> --timeout-postcopy > > When Postcopy migration is invoked via method (2) or (3) above, > the migrated guest on the destination host hangs sometimes. > > * During Postcopy migration, multiple threads are spawned on the destination > host to start the guest and setup devices. One such thread starts vhost > device via vhost_dev_start() function and another called fault_thread handles > page faults in user space using kernel's userfaultfd(2) system. > > * When fault_thread exits upon completion of Postcopy migration, it sends a > 'postcopy_end' message to the vhost-user device. But sometimes 'postcopy_end' > message is sent while vhost device is being setup via vhost_dev_start(). > > Thread-1 Thread-2 > > vhost_dev_start postcopy_ram_incoming_cleanup > vhost_device_iotlb_miss postcopy_notify > vhost_backend_update_device_iotlb vhost_user_postcopy_notifier > vhost_user_send_device_iotlb_msg vhost_user_postcopy_end > process_message_reply process_message_reply > vhost_user_read vhost_user_read > vhost_user_read_header vhost_user_read_header > "Fail to update device iotlb" "Failed to receive reply to > postcopy_end" > > This creates confusion when vhost device receives 'postcopy_end' message while > it is still trying to update IOTLB entries. > > This seems to leave the guest in a stranded/hung state because fault_thread > has exited saying Postcopy migration has ended, but vhost-device is probably > still expecting updates. QEMU logs following errors on the destination host > === > ... > qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. > Flags 0x0 instead of 0x5. > qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb > qemu-kvm: vhost_user_postcopy_end: 700871,700900: Failed to receive reply to > postcopy_end > qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. > Flags 0x0 instead of 0x5. > qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb > qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. > Flags 0x8 instead of 0x5. > qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb > qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. > Flags 0x16 instead of 0x5. > qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb > qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. > Flags 0x0 instead of 0x5. > qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb > ===
So are we going to see a version with BQL? > * Couple of patches here help to fix/handle these errors. > > Thank you. > --- > Prasad Pandit (2): > vhost: fail device start if iotlb update fails > vhost-user: add a request-reply lock > > hw/virtio/vhost-user.c | 74 ++++++++++++++++++++++++++++++++++ > hw/virtio/vhost.c | 6 ++- > include/hw/virtio/vhost-user.h | 3 ++ > 3 files changed, 82 insertions(+), 1 deletion(-) > > -- > 2.46.0