As expected by David when I trace on process_incoming_migration_co which prints the "readable" error I see the error pop up on "qemu_loadvm_state" It appears as "Thread 4 "CPU 0/KVM" received signal SIGUSR1" and similar which is just the break down of the guest.
Diving "into" qemu_loadvm_state reveals that it gets until "cpu_synchronize_all_pre_loadvm". In qemu_loadvm_state none of the initial checks fail. Then the "ret = vmstate_load_state(f, &vmstate_configuration, &savevm_state, 0);" seems to work fine was well. It seems reproducible in "cpu_synchronize_all_pre_loadvm" where the crash happens. I can catch the incoming qemu easily with: $ while ! pid=$(pidof qemu-system-x86_64); do /bin/true; done; gdb --pid ${pid} # Then in gdb break on "cpu_synchronize_all_pre_loadvm" # And when I step over it I the next thing I see is the "beginning of the end" for the process Thread 4 "CPU 0/KVM" received signal SIGUSR1, User defined signal 1. [Switching to Thread 0x7f418136e700 (LWP 3887)] __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 The guest only has one vcpu, so CPU_FOREACH(cpu) is not much of a loop. Looking down that path I tracked it to along: cpu_synchronize_all_pre_loadvm -> cpu_synchronize_pre_loadvm -> kvm_cpu_synchronize_pre_loadvm -> do_run_on_cpu Here it queues the function "do_kvm_cpu_synchronize_pre_loadvm" onto the vcpu. That is done via queue_work_on_cpu(cpu, &wi); which in turn uses eventually "qemu_cpu_kick_thread(cpu);" That seems to trigger the first SIGUSR1 Following that I get the breakpoint that I set at "do_kvm_cpu_synchronize_pre_loadvm". The actual function only sets "cpu->vcpu_dirty = true;" and works. On the way out from there, there is a "qemu_kvm_wait_io_event" which leads to the next SIGUSR1. Going on I see another "do_run_on_cpu" with "vapic_do_enable_tpr_reporting" as the function. I set a breakpoint on that as well but took a full CPUstate before going on: p *cpu $4 = {parent_obj = {parent_obj = {class = 0x5ffe7170c0, free = 0x7f62328f15a0 <g_free>, properties = 0x5ffe736e40, ref = 1, parent = 0x5ffe726160}, id = 0x0, realized = true, pending_deleted_event = false, opts = 0x0, hotplugged = 0, parent_bus = 0x0, gpios = { lh_first = 0x0}, child_bus = {lh_first = 0x0}, num_child_bus = 0, instance_id_alias = -1, alias_required_for_version = 0}, nr_cores = 1, nr_threads = 1, thread = 0x5ffe803cd0, thread_id = 8498, running = false, has_waiter = false, halt_cond = 0x5ffe803cf0, thread_kicked = true, created = true, stop = false, stopped = true, unplug = false, crash_occurred = false, exit_request = false, interrupt_request = 4, singlestep_enabled = 0, icount_budget = 0, icount_extra = 0, jmp_env = {{__jmpbuf = {0, 0, 0, 0, 0, 0, 0, 0}, __mask_was_saved = 0, __saved_mask = {__val = {0 <repeats 16 times>}}}}, work_mutex = {lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, initialized = true}, queued_work_first = 0x5fffefc990, queued_work_last = 0x5fffefc990, cpu_ases = 0x5ffe803c10, num_ases = 1, as = 0x5ffe7f9690, memory = 0x5ffe725bd0, env_ptr = 0x5ffe7e44c0, tb_jmp_cache = {0x0 <repeats 4096 times>}, gdb_regs = 0x0, gdb_num_regs = 57, gdb_num_g_regs = 57, node = {tqe_next = 0x0, tqe_prev = 0x5ffc1783f0 <cpus>}, breakpoints = {tqh_first = 0x0, tqh_last = 0x5ffe7e4430}, watchpoints = {tqh_first = 0x0, tqh_last = 0x5ffe7e4440}, watchpoint_hit = 0x0, opaque = 0x0, mem_io_pc = 0, mem_io_vaddr = 0, kvm_fd = 19, kvm_state = 0x5ffe7357a0, kvm_run = 0x7f62374bc000, trace_dstate_delayed = {0}, trace_dstate = {0}, cpu_index = 0, halted = 1, can_do_io = 1, exception_index = -1, vcpu_dirty = true, throttle_thread_scheduled = false, icount_decr = {u32 = 0, u16 = {low = 0, high = 0}}, hax_vcpu = 0x0, pending_tlb_flush = 7} Continuing I hit the "vapic_do_enable_tpr_reporting" with qemu still running. Things go on, the next candidate for "do_run_on_cpu" is "kvm_apic_put" Another SIGUSR1 to get that kicked it seems. "kvm_apic_put" breakpoint is reached after that kick. Next for "do_run_on_cpu" is "do_kvm_cpu_synchronize_post_init". And that triggered the fourth SIGUSR1. Before I only saw four, hopefully that is the same with so much breakpoints. Checked the cpu state again: 1880 static void do_kvm_cpu_synchronize_post_init(CPUState *cpu, run_on_cpu_data arg) 1881 { 1882 kvm_arch_put_registers(cpu, KVM_PUT_FULL_STATE); 1883 cpu->vcpu_dirty = false; 1884 } 1885 (gdb) p cpu $5 = (CPUState *) 0x5ffe7dc230 (gdb) p *cpu $6 = {parent_obj = {parent_obj = {class = 0x5ffe7170c0, free = 0x7f62328f15a0 <g_free>, properties = 0x5ffe736e40, ref = 1, parent = 0x5ffe726160}, id = 0x0, realized = true, pending_deleted_event = false, opts = 0x0, hotplugged = 0, parent_bus = 0x0, gpios = { lh_first = 0x0}, child_bus = {lh_first = 0x0}, num_child_bus = 0, instance_id_alias = -1, alias_required_for_version = 0}, nr_cores = 1, nr_threads = 1, thread = 0x5ffe803cd0, thread_id = 8498, running = false, has_waiter = false, halt_cond = 0x5ffe803cf0, thread_kicked = false, created = true, stop = false, stopped = true, unplug = false, crash_occurred = false, exit_request = false, interrupt_request = 4, singlestep_enabled = 0, icount_budget = 0, icount_extra = 0, jmp_env = {{__jmpbuf = {0, 0, 0, 0, 0, 0, 0, 0}, __mask_was_saved = 0, __saved_mask = {__val = {0 <repeats 16 times>}}}}, work_mutex = {lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, initialized = true}, queued_work_first = 0x0, queued_work_last = 0x0, cpu_ases = 0x5ffe803c10, num_ases = 1, as = 0x5ffe7f9690, memory = 0x5ffe725bd0, env_ptr = 0x5ffe7e44c0, tb_jmp_cache = {0x0 <repeats 4096 times>}, gdb_regs = 0x0, gdb_num_regs = 57, gdb_num_g_regs = 57, node = {tqe_next = 0x0, tqe_prev = 0x5ffc1783f0 <cpus>}, breakpoints = {tqh_first = 0x0, tqh_last = 0x5ffe7e4430}, watchpoints = {tqh_first = 0x0, tqh_last = 0x5ffe7e4440}, watchpoint_hit = 0x0, opaque = 0x0, mem_io_pc = 0, mem_io_vaddr = 0, kvm_fd = 19, kvm_state = 0x5ffe7357a0, kvm_run = 0x7f62374bc000, trace_dstate_delayed = {0}, trace_dstate = {0}, cpu_index = 0, halted = 1, can_do_io = 1, exception_index = -1, vcpu_dirty = true, throttle_thread_scheduled = false, icount_decr = {u32 = 0, u16 = {low = 0, high = 0}}, hax_vcpu = 0x0, pending_tlb_flush = 7} And from here stepping into kvm_arch_put_registers: kvm_arch_put_registers (cpu=cpu@entry=0x5ffe7dc230, level=level@entry=3) at ./target/i386/kvm.c:2591 That still is the same vcpu as all the time, x86_cpu is optimized out unfortunately as I had no full debug build with -O0. I see it setting up regs in kvm_arch_put_registers without error (all ret=0) and return to do_kvm_cpu_synchronize_post_init. This eventually sets "cpu->vcpu_dirty = false;" After this seems all good I steped through the "way out" and there came another "qemu_kvm_wait_io_event(cpu);". Without considering this being critical I stepped with "n" and qemu was gone with all its threads. qemu_kvm_cpu_thread_fn (arg=0x5ffe7dc230) at ./cpus.c:1134 1134 } while (!cpu->unplug || cpu_can_run(cpu)); (gdb) n 1127 if (cpu_can_run(cpu)) { (gdb) n 1133 qemu_kvm_wait_io_event(cpu); (gdb) n [Thread 0x7f6227857700 (LWP 8498) exited] -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1711602 Title: --copy-storage-all failing with qemu 2.10 Status in QEMU: New Status in libvirt package in Ubuntu: Confirmed Status in qemu package in Ubuntu: Confirmed Bug description: We fixed an issue around disk locking already in regard to qemu-nbd [1], but there still seem to be issues. $ virsh migrate --live --copy-storage-all kvmguest-artful-normal qemu+ssh://10.22.69.196/system error: internal error: qemu unexpectedly closed the monitor: 2017-08-18T12:10:29.800397Z qemu-system-x86_64: -chardev pty,id=charserial0: char device redirected to /dev/pts/0 (label charserial0) 2017-08-18T12:10:48.545776Z qemu-system-x86_64: load of migration failed: Input/output error Source libvirt log for the guest: 2017-08-18 12:09:08.251+0000: initiating migration 2017-08-18T12:09:08.809023Z qemu-system-x86_64: Unable to read from socket: Connection reset by peer 2017-08-18T12:09:08.809481Z qemu-system-x86_64: Unable to read from socket: Connection reset by peer Target libvirt log for the guest: 2017-08-18T12:09:08.730911Z qemu-system-x86_64: load of migration failed: Input/output error 2017-08-18 12:09:09.010+0000: shutting down, reason=crashed Given the timing it seems that the actual copy now works (it is busy ~10 seconds on my environment which would be the copy). Also we don't see the old errors we saw before, but afterwards on the actual take-over it fails. Dmesg has no related denials as often apparmor is in the mix. Need to check libvirt logs of source [2] and target [3] in Detail. [1]: https://lists.gnu.org/archive/html/qemu-devel/2017-08/msg02200.html [2]: http://paste.ubuntu.com/25339356/ [3]: http://paste.ubuntu.com/25339358/ To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1711602/+subscriptions