[Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10

ChristianEhrhardt Mon, 21 Aug 2017 08:14:59 -0700

As expected by David when I trace on process_incoming_migration_co which prints 
the "readable" error I see the error pop up on "qemu_loadvm_state"
It appears as "Thread 4 "CPU 0/KVM" received signal SIGUSR1" and similar which 
is just the break down of the guest.



Diving "into" qemu_loadvm_state reveals that it gets until 
"cpu_synchronize_all_pre_loadvm".
In qemu_loadvm_state none of the initial checks fail.
Then the "ret = vmstate_load_state(f, &vmstate_configuration, &savevm_state, 
0);" seems to work fine was well.
It seems reproducible in "cpu_synchronize_all_pre_loadvm" where the crash 
happens.

I can catch the incoming qemu easily with:
$ while ! pid=$(pidof qemu-system-x86_64); do /bin/true; done; gdb --pid ${pid}
# Then in gdb break on "cpu_synchronize_all_pre_loadvm"
# And when I step over it I the next thing I see is the "beginning of the end" 
for the process
Thread 4 "CPU 0/KVM" received signal SIGUSR1, User defined signal 1.
[Switching to Thread 0x7f418136e700 (LWP 3887)]
__lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135

The guest only has one vcpu, so CPU_FOREACH(cpu) is not much of a loop.
Looking down that path I tracked it to along:
cpu_synchronize_all_pre_loadvm -> cpu_synchronize_pre_loadvm -> 
kvm_cpu_synchronize_pre_loadvm -> do_run_on_cpu

Here it queues the function "do_kvm_cpu_synchronize_pre_loadvm" onto the vcpu.
That is done via queue_work_on_cpu(cpu, &wi); which in turn uses eventually 
"qemu_cpu_kick_thread(cpu);"
That seems to trigger the first SIGUSR1

Following that I get the breakpoint that I set at 
"do_kvm_cpu_synchronize_pre_loadvm".
The actual function only sets "cpu->vcpu_dirty = true;" and works.

On the way out from there, there is a "qemu_kvm_wait_io_event" which
leads to the next SIGUSR1.

Going on I see another "do_run_on_cpu" with "vapic_do_enable_tpr_reporting" as 
the function.
I set a breakpoint on that as well but took a full CPUstate before going on:
p *cpu
$4 = {parent_obj = {parent_obj = {class = 0x5ffe7170c0, free = 0x7f62328f15a0 
<g_free>, properties = 0x5ffe736e40, ref = 1, 
      parent = 0x5ffe726160}, id = 0x0, realized = true, pending_deleted_event 
= false, opts = 0x0, hotplugged = 0, parent_bus = 0x0, gpios = {
      lh_first = 0x0}, child_bus = {lh_first = 0x0}, num_child_bus = 0, 
instance_id_alias = -1, alias_required_for_version = 0}, nr_cores = 1, 
  nr_threads = 1, thread = 0x5ffe803cd0, thread_id = 8498, running = false, 
has_waiter = false, halt_cond = 0x5ffe803cf0, thread_kicked = true, 
  created = true, stop = false, stopped = true, unplug = false, crash_occurred 
= false, exit_request = false, interrupt_request = 4, 
  singlestep_enabled = 0, icount_budget = 0, icount_extra = 0, jmp_env = 
{{__jmpbuf = {0, 0, 0, 0, 0, 0, 0, 0}, __mask_was_saved = 0, 
      __saved_mask = {__val = {0 <repeats 16 times>}}}}, work_mutex = {lock = 
{__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, 
        __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next 
= 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, 
    initialized = true}, queued_work_first = 0x5fffefc990, queued_work_last = 
0x5fffefc990, cpu_ases = 0x5ffe803c10, num_ases = 1, 
  as = 0x5ffe7f9690, memory = 0x5ffe725bd0, env_ptr = 0x5ffe7e44c0, 
tb_jmp_cache = {0x0 <repeats 4096 times>}, gdb_regs = 0x0, 
  gdb_num_regs = 57, gdb_num_g_regs = 57, node = {tqe_next = 0x0, tqe_prev = 
0x5ffc1783f0 <cpus>}, breakpoints = {tqh_first = 0x0, 
    tqh_last = 0x5ffe7e4430}, watchpoints = {tqh_first = 0x0, tqh_last = 
0x5ffe7e4440}, watchpoint_hit = 0x0, opaque = 0x0, mem_io_pc = 0, 
  mem_io_vaddr = 0, kvm_fd = 19, kvm_state = 0x5ffe7357a0, kvm_run = 
0x7f62374bc000, trace_dstate_delayed = {0}, trace_dstate = {0}, 
  cpu_index = 0, halted = 1, can_do_io = 1, exception_index = -1, vcpu_dirty = 
true, throttle_thread_scheduled = false, icount_decr = {u32 = 0, 
    u16 = {low = 0, high = 0}}, hax_vcpu = 0x0, pending_tlb_flush = 7}

Continuing I hit the "vapic_do_enable_tpr_reporting" with qemu still
running.

Things go on, the next candidate for "do_run_on_cpu" is "kvm_apic_put"
Another SIGUSR1 to get that kicked it seems.
"kvm_apic_put" breakpoint is reached after that kick.

Next for "do_run_on_cpu"  is "do_kvm_cpu_synchronize_post_init". And
that triggered the fourth SIGUSR1. Before I only saw four, hopefully
that is the same with so much breakpoints.

Checked the cpu state again:
1880    static void do_kvm_cpu_synchronize_post_init(CPUState *cpu, 
run_on_cpu_data arg)
1881    {
1882        kvm_arch_put_registers(cpu, KVM_PUT_FULL_STATE);
1883        cpu->vcpu_dirty = false;
1884    }
1885
(gdb) p cpu
$5 = (CPUState *) 0x5ffe7dc230
(gdb) p *cpu
$6 = {parent_obj = {parent_obj = {class = 0x5ffe7170c0, free = 0x7f62328f15a0 
<g_free>, properties = 0x5ffe736e40, ref = 1, 
      parent = 0x5ffe726160}, id = 0x0, realized = true, pending_deleted_event 
= false, opts = 0x0, hotplugged = 0, parent_bus = 0x0, gpios = {
      lh_first = 0x0}, child_bus = {lh_first = 0x0}, num_child_bus = 0, 
instance_id_alias = -1, alias_required_for_version = 0}, nr_cores = 1, 
  nr_threads = 1, thread = 0x5ffe803cd0, thread_id = 8498, running = false, 
has_waiter = false, halt_cond = 0x5ffe803cf0, thread_kicked = false, 
  created = true, stop = false, stopped = true, unplug = false, crash_occurred 
= false, exit_request = false, interrupt_request = 4, 
  singlestep_enabled = 0, icount_budget = 0, icount_extra = 0, jmp_env = 
{{__jmpbuf = {0, 0, 0, 0, 0, 0, 0, 0}, __mask_was_saved = 0, 
      __saved_mask = {__val = {0 <repeats 16 times>}}}}, work_mutex = {lock = 
{__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, 
        __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next 
= 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, 
    initialized = true}, queued_work_first = 0x0, queued_work_last = 0x0, 
cpu_ases = 0x5ffe803c10, num_ases = 1, as = 0x5ffe7f9690, 
  memory = 0x5ffe725bd0, env_ptr = 0x5ffe7e44c0, tb_jmp_cache = {0x0 <repeats 
4096 times>}, gdb_regs = 0x0, gdb_num_regs = 57, 
  gdb_num_g_regs = 57, node = {tqe_next = 0x0, tqe_prev = 0x5ffc1783f0 <cpus>}, 
breakpoints = {tqh_first = 0x0, tqh_last = 0x5ffe7e4430}, 
  watchpoints = {tqh_first = 0x0, tqh_last = 0x5ffe7e4440}, watchpoint_hit = 
0x0, opaque = 0x0, mem_io_pc = 0, mem_io_vaddr = 0, kvm_fd = 19, 
  kvm_state = 0x5ffe7357a0, kvm_run = 0x7f62374bc000, trace_dstate_delayed = 
{0}, trace_dstate = {0}, cpu_index = 0, halted = 1, can_do_io = 1, 
  exception_index = -1, vcpu_dirty = true, throttle_thread_scheduled = false, 
icount_decr = {u32 = 0, u16 = {low = 0, high = 0}}, 
  hax_vcpu = 0x0, pending_tlb_flush = 7}

And from here stepping into kvm_arch_put_registers:
kvm_arch_put_registers (cpu=cpu@entry=0x5ffe7dc230, level=level@entry=3) at 
./target/i386/kvm.c:2591

That still is the same vcpu as all the time, x86_cpu is optimized out
unfortunately as I had no full debug build with -O0.

I see it setting up regs in kvm_arch_put_registers without error (all ret=0) 
and return to do_kvm_cpu_synchronize_post_init.
This eventually sets "cpu->vcpu_dirty = false;"

After this seems all good I steped through the "way out" and there came another 
"qemu_kvm_wait_io_event(cpu);".
Without considering this being critical I stepped with "n" and qemu was gone 
with all its threads.

qemu_kvm_cpu_thread_fn (arg=0x5ffe7dc230) at ./cpus.c:1134
1134        } while (!cpu->unplug || cpu_can_run(cpu));
(gdb) n
1127            if (cpu_can_run(cpu)) {
(gdb) n
1133            qemu_kvm_wait_io_event(cpu);
(gdb) n
[Thread 0x7f6227857700 (LWP 8498) exited]

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1711602

Title:
  --copy-storage-all failing with qemu 2.10

Status in QEMU:
  New
Status in libvirt package in Ubuntu:
  Confirmed
Status in qemu package in Ubuntu:
  Confirmed

Bug description:
  We fixed an issue around disk locking already in regard to qemu-nbd
  [1], but there still seem to be issues.

  $ virsh migrate --live --copy-storage-all kvmguest-artful-normal 
qemu+ssh://10.22.69.196/system
  error: internal error: qemu unexpectedly closed the monitor: 
2017-08-18T12:10:29.800397Z qemu-system-x86_64: -chardev pty,id=charserial0: 
char device redirected to /dev/pts/0 (label charserial0)
  2017-08-18T12:10:48.545776Z qemu-system-x86_64: load of migration failed: 
Input/output error

  Source libvirt log for the guest:
  2017-08-18 12:09:08.251+0000: initiating migration
  2017-08-18T12:09:08.809023Z qemu-system-x86_64: Unable to read from socket: 
Connection reset by peer
  2017-08-18T12:09:08.809481Z qemu-system-x86_64: Unable to read from socket: 
Connection reset by peer

  Target libvirt log for the guest:
  2017-08-18T12:09:08.730911Z qemu-system-x86_64: load of migration failed: 
Input/output error
  2017-08-18 12:09:09.010+0000: shutting down, reason=crashed

  Given the timing it seems that the actual copy now works (it is busy ~10 
seconds on my environment which would be the copy).
  Also we don't see the old errors we saw before, but afterwards on the actual 
take-over it fails.

  Dmesg has no related denials as often apparmor is in the mix.

  Need to check libvirt logs of source [2] and target [3] in Detail.

  [1]: https://lists.gnu.org/archive/html/qemu-devel/2017-08/msg02200.html
  [2]: http://paste.ubuntu.com/25339356/
  [3]: http://paste.ubuntu.com/25339358/

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1711602/+subscriptions

[Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10

Reply via email to