On Thu, 2021-03-04 at 11:15 +0100, Stefano Garzarella wrote: > On Wed, Mar 03, 2021 at 02:07:24PM +0200, Maxim Levitsky wrote: > > On Tue, 2021-03-02 at 15:52 +0100, Stefano Garzarella wrote: > > > On Mon, Mar 01, 2021 at 02:56:40PM +0200, Maxim Levitsky wrote: > > > > On Mon, 2021-03-01 at 12:17 +0100, Paolo Bonzini wrote: > > > > > If kvm_arch_remove_sw_breakpoint finds that a software breakpoint > > > > > does not > > > > > have an INT3 instruction, it fails. This can happen if one sets a > > > > > software breakpoint in a kernel module and then reloads it. gdb then > > > > > thinks the breakpoint cannot be deleted and there is no way to add it > > > > > back. > > > > > > > > > > Suggested-by: Maxim Levitsky <mlevi...@redhat.com> > > > > > Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> > > > > > --- > > > > > target/i386/kvm/kvm.c | 9 +++++++-- > > > > > 1 file changed, 7 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c > > > > > index 0b5755e42b..c8d61daf68 100644 > > > > > --- a/target/i386/kvm/kvm.c > > > > > +++ b/target/i386/kvm/kvm.c > > > > > @@ -4352,8 +4352,13 @@ int kvm_arch_remove_sw_breakpoint(CPUState > > > > > *cs, struct kvm_sw_breakpoint *bp) > > > > > { > > > > > uint8_t int3; > > > > > > > > > > - if (cpu_memory_rw_debug(cs, bp->pc, &int3, 1, 0) || int3 != 0xcc > > > > > || > > > > > - cpu_memory_rw_debug(cs, bp->pc, (uint8_t *)&bp->saved_insn, > > > > > 1, 1)) { > > > > > + if (cpu_memory_rw_debug(cs, bp->pc, &int3, 1, 0)) { > > > > > + return -EINVAL; > > > > > + } > > > > > + if (int3 != 0xcc) { > > > > > + return 0; > > > > > + } > > > > > + if (cpu_memory_rw_debug(cs, bp->pc, (uint8_t *)&bp->saved_insn, > > > > > 1, 1)) { > > > > > return -EINVAL; > > > > > } > > > > > return 0; > > > > > > > > There still remains a philosopical question if > > > > kvm_arch_remove_sw_breakpoint > > > > should always return 0, since for the usual case of kernel debugging > > > > where > > > > a breakpoint is in unloaded module, the above will probably still fail > > > > as paging for this module is removed as well by the kernel. > > > > It is still better though so: > > > > > > > > Reviewed-by: Maxim Levitsky <mlevi...@redhat.com> > > > > > > > > Note that I managed to make lx-symbols to work in a very stable way > > > > with attached WIP patch I hacked on this Sunday. > > > > I will send a cleaned up version of it to upstream when I have time. > > > > > > > > Since I make gdb unload the symbols, it works even without this patch. > > > > > > > > Added Stefano Garzarella to CC as I know that he tried to make this > > > > work as well. > > > > https://lkml.org/lkml/2020/10/5/514 > > > > > > Thanks Maxim for CCing me! > > > > > > Just a report when I tried these patches, but I'm not sure they are > > > related. > > > > > > I found that gdb 10 has some problem with QEMU: > > > > > > $ gdb --version > > > GNU gdb (GDB) Fedora 10.1-2.fc33 > > > > > > (gdb) lx-symbols > > > loading vmlinux > > > scanning for modules in linux/build > > > ../../gdb/dwarf2/frame.c:1085: internal-error: Unknown CFA rule. > > > > > > With gdb 9 'lx-symbols' works well, but I still have some issue when I > > > put a breakpoint to a symbol not yet loaded (vsock_core_register in this > > > example), then I load the module (vsock_loopback in this example) in the > > > guest. > > > > > > Whit your patch gdb stuck after loading the symbols of the first new > > > module: > > > (gdb) b vsock_core_register > > > Function "vsock_core_register" not defined. > > > Make breakpoint pending on future shared library load? (y or [n]) y > > > Breakpoint 1 (vsock_core_register) pending. > > > (gdb) c > > > Continuing. > > > loading @0xffffffffc00a1000: linux/build/net/vmw_vsock/vsock.ko > > > > > > Without your patch, gdb loops infinitely reloading the new module: > > > refreshing all symbols to reload module 'vsock' > > > loading vmlinux > > > loading @0xffffffffc00a1000: linux/build/net/vmw_vsock/vsock.ko > > > loading @0xffffffffc00ad000: linux/build/drivers/net/tun.ko > > > loading @0xffffffffc007e000: linux/build/net/bridge/bridge.ko > > > loading @0xffffffffc0077000: linux/build/net/802/stp.ko > > > loading @0xffffffffc0007000: linux/build/net/llc/llc.ko > > > loading @0xffffffffc0013000: linux/build/net/sunrpc/sunrpc.ko > > > loading @0xffffffffc000d000: > > > linux/build/net/ipv4/netfilter/ip_tables.ko > > > loading @0xffffffffc0000000: linux/build/net/netfilter/x_tables.ko > > > refreshing all symbols to reload module 'vsock' > > > loading vmlinux > > > loading @0xffffffffc00a1000: linux/build/net/vmw_vsock/vsock.ko > > > loading @0xffffffffc00ad000: linux/build/drivers/net/tun.ko > > > ... > > > > > > I'll try to get a better look at what's going on. > > > > Let me then explain what I found: > > > > First of all initial execution of lx-symbols works and always worked for me > > (I use gdb 9 though from fedora 32 (9.1-7.fc32)) > > > > > > Then a special breakpoint is placed on do_init_module > > (it is hidden from the user) > > > > Once it is hit two things can happen: > > > > 1. if a not yet seen module is loaded, > > (module that wasn't loaded last time all symbols were reloaded) > > its symbols are loaded to gdb with 'add-symbol-file' command. > > > > 2. if module that was already loaded to gdb, is loaded (see above), > > then 'big hammer' is used: > > > > a. all existing breakpoints (including the hidden one) > > are disabled since reloading everything > > indeed messes up the gdb state > > > > b. the executable's symbols (the kernel) are reloaded, > > which makes gdb unload all symbols, and then all symbols are loaded > > again (in the same way as lx-symbols works), > > including the symbols of currently loading module. > > > > c. all breakpoints are enabled again > > > > > > Now the issue that you originally reported on LKML is because the (1) > > apparently also messes up the software breakpoints state in gdb, > > and that makes gdb to not to temporary remove the breakpoint > > in do_init_module once the execution is resumed, and then > > the guest starts executing garbage (bytes after 'int3' instruction). > > > > The second issue is that (2), aka the big hammer isn't really needed. > > GDB does have (maybe this is recent command?) a 'remove-symbol-file' > > command. > > > > So it is possible to do remove-symbol-file/add-symbol-file' > > on known module reload instead of reloading everything. > > > > But this has a small issue. The issue is that such known module > > was already reloaded, so all int3 instructions that gdb placed into > > it are already gone, so sofware breakpoints placed to it won't work > > This is what the patch that Paulo sent fixes. > > > > However it is even better to create another hidden breakpoint on module > > removal > > path (I used free_module for that) and unload the symbols there. > > This allows the gdb to cleanly remove all software breakpoints > > in that module, show them again as pending, and even re-install > > them again once the module is loaded again. > > > > So those are the two changes I made: > > > > 1. I added a breakpoint on module removal which also does > > a. disable all breakpoints > > b. unload the module's symbols > > c. enable all breakpoints > > > > 2. On module load I also do > > a. disable all breakpoints > > b. load the module's symbols > > c. enable all breakpoints > > > > > > There is still an issue that both 'load' and 'unload' breakpoint > > can hit more that once. > > This happens because these are software breakpoints and > > load/unload code in the kernel is executed with interrupts enabled. > > > > So what is happening is that once the hidden breakpoint is hit, gdb script > > attached to it is done, and VM is resumed, gdb does more or less this: > > > > a. remove the breakpoint > > b. do a single step > > c. re-install the breakpoint. > > > > However the single step more often than not, lands us into an interrupt > > handler, and so once the handler is finished we end up re-executing the > > instruction on which breakpoint was set. > > On a single vCPU it works more or less (with several tries) on my machine, > > but with many vCPUs, I can end up in live lock like state > > when the above prevents the VM from progressing. > > > > I think we can fix this on kvm side by not injecting interrupts > > when single step is done. > > Thank you so much for this detailed description, very much appreciated! > > > In fact the below patch works for me, > > Not only it fixes the live lock but makes these hidden breakpoints > > hit only once in my testing. > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index eec62c0dafc36..8b7a4e27bcf66 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -8501,6 +8501,12 @@ static void inject_pending_event(struct kvm_vcpu > > *vcpu, bool *req_immediate_exit > > goto busy; > > } > > > > + /* > > + * Don't inject interrupts while single stepping to make guest > > debug easier > > + */ > > + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) > > + can_inject = false; > > + > > > > With this patch lx-symbols works almost perfectly for me (on AMD). > > I'll try this patch!
Note that the patch to disable interrupt injection while single stepping is wrong, due to some whatever mistake I made while rebasing things. I attached an updated version. With it (and the patch to lx-symbols itself which should be applied to the guest kernel), I was able to run lx-symbols very well on both intel and AMD. Note that I compile the guest kernel in the guest and install, but then I copy it to the host, and I run the gdb from there. This is my debug.sh script: vm adm hmp "gdbserver tcp::2345" gdb ~/Kernel/vm-fedora/src/vmlinux \ -ex "target remote :2345" \ -ex "cd /home/mlevitsk/Kernel/vm-fedora/src" \ -ex "lx-symbols" \ -ex "cont" Note that the debug must start after guest kernel was loaded. Best regards, Maxim Levitsky > > > The only remaing issue (that might be not easy to fix) > > is that I still can't place breakpoints in __init module code. > > That code is physically removed from the kernel once the module is done > > loading, > > and it seems that its debug info isn't correct to even make a hardware > > breakpoint work on it. > > (gdb shows very small addresses for these symbols). > > > > As for why this doesn't work for you I have 3 theories: > > > > 1. The whole 'reload symbols on breakpoint' is forbidden > > by gdb in the manual, that is one of the reasons that I had > > to disable software breakpoints prior to doing this. > > There might be other things that break in different gdb versions. > > I don't see a way to make it work without doing this. > > Maybe this can also be the reason why gdb 10 doesn't work for me. > I should investigate further. > > > 2. Maybe you test that on Intel and something is broken there > > on KVM level (I tested only AMD). > > Yes, I'm on Intel. > > > 3. Maybe you debug a nested guest? I haven't tested if guest debug > > works fine in this configuration. > > Nope, I'm not debugging nested guest in that case. > > Thanks for your help, > Stefano >
From c0fe19355f5c99bd8d1e6c5345fb32f259be8131 Mon Sep 17 00:00:00 2001 From: Maxim Levitsky <mlevi...@redhat.com> Date: Sun, 28 Feb 2021 23:52:08 +0200 Subject: [PATCH 1/2] gdb: rework gdb debug script Signed-off-by: Maxim Levitsky <mlevi...@redhat.com> --- scripts/gdb/linux/symbols.py | 96 ++++++++++++++++++++++++++---------- 1 file changed, 71 insertions(+), 25 deletions(-) diff --git a/scripts/gdb/linux/symbols.py b/scripts/gdb/linux/symbols.py index 1be9763cf8bb2..13f21afc2059d 100644 --- a/scripts/gdb/linux/symbols.py +++ b/scripts/gdb/linux/symbols.py @@ -17,6 +17,24 @@ import re from linux import modules, utils +def save_state(): + breakpoints = [] + if hasattr(gdb, 'breakpoints') and not gdb.breakpoints() is None: + for bp in gdb.breakpoints(): + breakpoints.append({'breakpoint': bp, 'enabled': bp.enabled}) + bp.enabled = False + + show_pagination = gdb.execute("show pagination", to_string=True) + pagination = show_pagination.endswith("on.\n") + gdb.execute("set pagination off") + + return {"breakpoints":breakpoints, "show_pagination": show_pagination} + +def load_state(state): + for breakpoint in state["breakpoints"]: + breakpoint['breakpoint'].enabled = breakpoint['enabled'] + gdb.execute("set pagination %s" % ("on" if state["show_pagination"] else "off")) + if hasattr(gdb, 'Breakpoint'): class LoadModuleBreakpoint(gdb.Breakpoint): @@ -30,26 +48,38 @@ if hasattr(gdb, 'Breakpoint'): module_name = module['name'].string() cmd = self.gdb_command + # module already loaded, false alarm + if module_name in cmd.loaded_modules: + return False + # enforce update if object file is not found cmd.module_files_updated = False # Disable pagination while reporting symbol (re-)loading. # The console input is blocked in this context so that we would # get stuck waiting for the user to acknowledge paged output. - show_pagination = gdb.execute("show pagination", to_string=True) - pagination = show_pagination.endswith("on.\n") - gdb.execute("set pagination off") + state = save_state() + cmd.load_module_symbols(module) + load_state(state) + return False - if module_name in cmd.loaded_modules: - gdb.write("refreshing all symbols to reload module " - "'{0}'\n".format(module_name)) - cmd.load_all_symbols() - else: - cmd.load_module_symbols(module) + class UnLoadModuleBreakpoint(gdb.Breakpoint): + def __init__(self, spec, gdb_command): + super(UnLoadModuleBreakpoint, self).__init__(spec, internal=True) + self.silent = True + self.gdb_command = gdb_command - # restore pagination state - gdb.execute("set pagination %s" % ("on" if pagination else "off")) + def stop(self): + module = gdb.parse_and_eval("mod") + module_name = module['name'].string() + cmd = self.gdb_command + if not module_name in cmd.loaded_modules: + return False + + state = save_state() + cmd.unload_module_symbols(module) + load_state(state) return False @@ -64,8 +94,9 @@ lx-symbols command.""" module_paths = [] module_files = [] module_files_updated = False - loaded_modules = [] + loaded_modules = {} breakpoint = None + ubreakpoint = None def __init__(self): super(LxSymbols, self).__init__("lx-symbols", gdb.COMMAND_FILES, @@ -129,21 +160,32 @@ lx-symbols command.""" filename=module_file, addr=module_addr, sections=self._section_arguments(module)) + gdb.execute(cmdline, to_string=True) - if module_name not in self.loaded_modules: - self.loaded_modules.append(module_name) + self.loaded_modules[module_name] = {"module_file": module_file, + "module_addr": module_addr} else: gdb.write("no module object found for '{0}'\n".format(module_name)) + def unload_module_symbols(self, module): + module_name = module['name'].string() + + module_file = self.loaded_modules[module_name]["module_file"] + module_addr = self.loaded_modules[module_name]["module_addr"] + + gdb.write("unloading @{addr}: {filename}\n".format( + addr=module_addr, filename=module_file)) + cmdline = "remove-symbol-file {filename}".format( + filename=module_file) + + gdb.execute(cmdline, to_string=True) + del self.loaded_modules[module_name] + + def load_all_symbols(self): gdb.write("loading vmlinux\n") - # Dropping symbols will disable all breakpoints. So save their states - # and restore them afterward. - saved_states = [] - if hasattr(gdb, 'breakpoints') and not gdb.breakpoints() is None: - for bp in gdb.breakpoints(): - saved_states.append({'breakpoint': bp, 'enabled': bp.enabled}) + state = save_state() # drop all current symbols and reload vmlinux orig_vmlinux = 'vmlinux' @@ -153,15 +195,14 @@ lx-symbols command.""" gdb.execute("symbol-file", to_string=True) gdb.execute("symbol-file {0}".format(orig_vmlinux)) - self.loaded_modules = [] + self.loaded_modules = {} module_list = modules.module_list() if not module_list: gdb.write("no modules found\n") else: [self.load_module_symbols(module) for module in module_list] - for saved_state in saved_states: - saved_state['breakpoint'].enabled = saved_state['enabled'] + load_state(state) def invoke(self, arg, from_tty): self.module_paths = [os.path.expanduser(p) for p in arg.split()] @@ -177,8 +218,13 @@ lx-symbols command.""" if self.breakpoint is not None: self.breakpoint.delete() self.breakpoint = None - self.breakpoint = LoadModuleBreakpoint( - "kernel/module.c:do_init_module", self) + self.breakpoint = LoadModuleBreakpoint("kernel/module.c:do_init_module", self) + + if self.ubreakpoint is not None: + self.ubreakpoint.delete() + self.ubreakpoint = None + self.ubreakpoint = UnLoadModuleBreakpoint("kernel/module.c:free_module", self) + else: gdb.write("Note: symbol update on module loading not supported " "with this gdb version\n") -- 2.26.2
From 8619f7c836156b5704c32aca7b7808ef344989c3 Mon Sep 17 00:00:00 2001 From: Maxim Levitsky <mlevi...@redhat.com> Date: Wed, 3 Mar 2021 13:05:34 +0200 Subject: [PATCH 2/2] KVM: x86: Guest debug: don't inject interrupts while single stepping Signed-off-by: Maxim Levitsky <mlevi...@redhat.com> --- arch/x86/kvm/x86.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index b140fce829286..ab3a598520ff4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8542,6 +8542,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit goto busy; } + /* + * Don't inject interrupts while single stepping to make guest debug easier + */ + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) + return; + /* * Finally, inject interrupt events. If an event cannot be injected * due to architectural conditions (e.g. IF=0) a window-open exit -- 2.26.2