Hi Pavel, Thanks for all your feedbacks.
I’ve been following the discussion closely and find your approach quite interesting. As Jim explained, I’m also trying to have a conditional breakpoint, that is able to stop a specific thread (name or id) when the condition expression evaluates to true. I feel like stacking up options with your approach would imply doing more context switches. But it’s definitely a better fallback mechanism than the current one. I’ll try to make a prototype to see the performance difference for both approaches. > On Aug 15, 2019, at 10:10 AM, Pavel Labath <pa...@labath.sk> wrote: > > Hello Ismail, and wellcome to LLDB. You have a very interesting (and not > entirely trivial) project, and I wish you the best of luck in your work. I > think this will be a very useful addition to lldb. > > It sounds like you have researched the problem very well, and the overall > direction looks good to me. However, I do have some ideas suggestions about > possible tweaks/improvements that I would like to hear your thoughts on. > Please find my comments inline. > > On 14/08/2019 22:52, Ismail Bennani via lldb-dev wrote: >> Hi everyone, >> I’m Ismail, a compiler engineer intern at Apple. As a part of my internship, >> I'm adding Fast Conditional Breakpoints to LLDB, using code patching. >> Currently, the expressions that power conditional breakpoints are lowered >> to LLVM IR and LLDB knows how to interpret a subset of it. If that fails, >> the debugger JIT-compiles the expression (compiled once, and re-run on each >> breakpoint hit). In both cases LLDB must collect all program state used in >> the condition and pass it to the expression. >> The goal of my internship project is to make conditional breakpoints faster >> by: >> 1. Compiling the expression ahead-of-time, when setting the breakpoint and >> inject into the inferior memory only once. >> 2. Re-route the inferior execution flow to run the expression and check >> whether >> it needs to stop, in-process. >> This saves the cost of having to do the context switch between debugger and >> the inferior program (about 10 times) to compile and evaluate the condition. >> This feature is described on the [LLDB Project >> page](https://lldb.llvm.org/status/projects.html#use-the-jit-to-speed-up-conditional-breakpoint-evaluation). >> The goal would be to have it working for most languages and architectures >> supported by LLDB, however my original implementation will be for C-based >> languages targeting x86_64. It will be extended to AArch64 afterwards. >> Note the way my prototype is implemented makes it fully extensible for other >> languages and architectures. >> ## High Level Design >> Every time a breakpoint that holds a condition is hit, multiple context >> switches are needed in order to compile and evaluate the condition. >> First, the breakpoint is hit and the control is given to the debugger. >> That's where LLDB wraps the condition expression into a UserExpression that >> will get compiled and injected into the program memory. Another round-trip >> between the inferior and the LLDB is needed to run the compiled expression >> and extract the expression results that will tell LLDB to stop or not. >> To get rid of those context switches, we will evaluate the condition inside >> the program, and only stop when the condition is true. LLDB will achieve this >> by inserting a jump from the breakpoint address to a code section that will >> be allocated into the program memory. It will save the thread state, run the >> condition expression, restore the thread state and then execute the copied >> instruction(s) before jumping back to the regular program flow. >> Then we only trap and return control to LLDB when the condition is true. >> ## Implementation Details >> To be able to evaluate a breakpoint condition without interacting with the >> debugger, LLDB changes the inferior program execution flow by overwriting >> the instruction at which the breakpoint was set with a branching instruction. >> The original instruction(s) are copied to a memory stub allocated in the >> inferior program memory called the __Fast Conditional Breakpoint Trampoline__ >> or __FCBT__. The FCBT will allow us the re-route the program execution flow >> to >> check the condition in-process while preserving the original program >> behavior. >> This part is critical to setup Fast Conditional Breakpoints. >> ``` >> Inferior Binary Trampoline >> | . | +-------------------------+ >> | . | | | >> | . | +--------->+ Save RegisterContext | >> | . | | | | >> +-------------------------+ | +-------------------------+ >> | | | | | >> | Instruction | | | Build Arguments Struct | >> | | | | | >> +-------------------------+ | +-------------------------+ >> | +-----------+ | | >> | Branch to Trampoline | | Call Condition Checker | >> | +<----------+ | | >> +-------------------------+ | +-------------------------+ >> | | | | | >> | Instruction | | | Restore RegisterContext | >> | | | | | >> +-------------------------+ | +-------------------------+ >> | . | | | | >> | . | +----------+ Run Copied Instructions | >> | . | | | >> | . | +-------------------------+ >> ``` >> Once the execution reaches the Trampoline, several steps need to be taken. >> LLDB relies on its UserExpressions to JIT these more complex conditional >> expressions. However, since the execution will be handled by the debugged >> program, LLDB will generate some code ahead-of-time in theTrampoline that >> will allow the inferior to initialize the expression's argument structure. >> Generating the condition checker as well as the code to initialize >> the argument structure of each breakpoint hit is handled by >> __BreakpointInjectedSite__ class, which builds the conditional expression for >> all the BreakpointLocations, emits the `$__lldb_expr` function, and relocates >> variables in the `$__lldb_arg` structure. >> BreakpointInjectedSites are created in the __Process__ if the user enables >> the `-I | --inject-condition` flag when setting or modifying a breakpoint. >> Because the __FCBT__ is architecture specific, BreakpointInjectedSites will >> only be available when a target has added support to it, in the matching >> Architecture Plugin. >> Several parts of lldb have to be modified to implement this feature: >> - **Breakpoint**: Added BreakpointInjectedSite, and helper functions to the >> related class (Breakpoint, BreakpointLocation, >> BreakpointSite, BreakpointOptions) >> - **Plugins**: Added ObjectFileTrampoline for the unwinding >> Added x86_64 ABI support (FCBT setup & safety checks) >> - **Symbol**: Changed `FuncUnwinders` and `UnwindPlan` to support FCBT >> - **Target**: Added BreakpointInjectedSite creation to `Process` to >> insert >> the jump to the FCBT >> Added the Trampoline module creation to `ABI` for the >> unwinding >> ### Breakpoint Option >> Since Fast Conditional Breakpoints are still under development, they will not >> be on by default, but rather we will provide a flag to 'breakpoint set" and >> "breakpoint modify" to enable the feature. Note that the end-goal is to have >> them as a default and only fallback to regular conditional breakpoints on >> unsupported architectures. >> They can be enabled when using `-I | --inject-condition` option. These >> options >> can also be enabled using the Python Scripting Bridge public API, using the >> `InjectCondition(bool enable)` method on an __SBBreakpoint__ or >> __SBBreakpointLocation__ object. >> This feature is intended to be used with condition expression >> (`-c <expr> | --condition <expr>`), but also other conditions types such as: >> - Thread ID (`-t <thread-id> | --thread-id <thread-id>`) >> - Thread Index (`-x <thread-index> | --thread-index <thread-index>`) >> - Thread Queue Name >> ### Trampoline >> To be able to inject the condition, we need to re-route the debugged >> program's >> execution flow. This parts is handled in the __Trampoline__, a memory stub >> allocated in the inferior that will contain the condition check while >> preserving the program's original behavior. >> The trampoline is architecture specific and built by lldb. To have the >> condition evaluation work out-of-place, several steps need to be completed: >> 1. Save all the registers by pushing them to the stack >> 2. Build the `$__lldb_arg` structure by calling a injected UtilityFunction >> 3. Check the condition by calling the injected UserExpression and execute a >> trap if the condition is true. >> 4. Restore register context >> 5. Rewrite and run original copied instructions operands >> All the values needed for the steps can be computed ahead of time, when the >> breakpoint is set (i.e: size of the allocation, jump address, relocation >> ...). >> Since the x86_64 ISA has variable instruction size, LLDB moves enough >> instructions in the trampoline to be able to overwrite them with a jump to >> the >> trampoline. Also, the allocation region for the trampoline might be too far >> away for a single jump, so we might need to have several branch island before >> reaching the trampoline (WIP). >> ### BreakpointInjectedSite >> To handle the Fast Conditional Breakpoint setup, LLDB uses >> __BreakpointInjectedSites__ which is a sub-class of the BreakpointSite class. >> BreakpointInjectedSites uses different `UserExpression` to resolve variables >> and inject the condition checker. >> #### Condition Checker >> Because a BreakpointSite can have multiple BreakpointLocations with different >> conditions, LLDB need first iterate over each owner of the BreakpointSite and >> gather all the conditions. If one of the BreakpointLocations doesn't have a >> condition or the condition is not set to be injected, the >> BreakpointInjectedSite will behave as a regular BreakpointSite. >> Once all the conditions are fetched, LLDB will create a __UserExpression__ >> with the injected trap instruction. >> When a trap is hit, LLDB uses the __BreakpointSiteList__, a map from a trap >> address to a BreakpointSite to identify where to stop. To allow LLDB to catch >> the injected trap at runtime, it will disassemble the compiled expression and >> scan for the trap address. The injected trap address is then added to LLDB's >> __BreakpointSiteList__. >> When generated, this is what the condition checker looks like: >> ```cpp >> void $__lldb_expr(void *$__lldb_arg) >> { >> /*lldb_BODY_START*/ >> if (condition) { >> __builtin_debugtrap(); >> }; >> /*lldb_BODY_END*/ >> } >> ``` >> #### Argument Builder >> The conditional expression will often refer to local variables, and the >> references to these variables need to be tied to the instances of them in the >> current frame. >> Usually the expression evaluator invokes the __Materializer__ which fetches >> the variables values and fills the `$__lldb_arg` structure. But since we >> don't >> want to switch contexts, LLDB has to resolve used variables by generating >> code >> that will initialize the `$__lldb_arg` pointer, before running the condition >> checker. >> That's where the __Argument Builder__ comes in. >> The argument builder uses an `UtilityFunction` to generate the >> `$__lldb_create_args_struct` function. It is called by the Trampoline >> before the condition checker, in order to resolve variables used in the >> condition expression. >> `$__lldb_create_args_struct` will fill the `$__lldb_arg` in several steps: >> 1. It takes advantage of the fact that LLDB saved all the registers to the >> stack and map them in an `register_context` structure. >> ```cpp >> typedef struct { >> // General Purpose Registers >> } register_context; >> ``` >> 2. Using information from the variable resolver, it allocates a memory >> stub >> that will contain the used variable addresses. >> 3. Then, it will use the register values and the collected metadata to >> compute the used variable address and write that into the >> newly allocated structure. >> 4. Finally the allocated structure is returned to the trampoline, which will >> pass it as an argument to the injected condition checker. > I am wondering whether we really need to involve the memory allocation > functions here. What's the size of this address structure? I would expect it > to be relatively small compared to the size of the entire register context > that we have just saved to the stack. If that's the case, the case then maybe > we could have the trampoline allocate some space on the stack and pass that > as an argument to the $__lldb_arg building code. Allocating the $__lldb_arg struct in the stack is on my to-do list. This will change in the coming revisions. > >> Since `$__lldb_create_args_struct` uses the same JIT Engine as the >> UserExpression, LLDB will parse, build and insert it in the program memory. >> #### Variable Resolver >> When creating a Fast Conditional Breakpoint, the __debug info__ tells us >> where the used variables are located. Using this information and the saved >> register context, we can generate code that will resolve the variables at >> runtime (__Step 3 of the Argument Builder__). >> LLDB will first get the `DeclMap` from the condition UserExpression and pull >> a >> list of the used variables. While iterating on that list, LLDB extracts each >> variable's __DWARF Expression__. >> DWARF expressions explain how to reconstruct a variable's values using DWARF >> operations. >> The reason why LLDB needs the register context is because local variable are >> often at an offset of the __Stack Base Pointer register__ or written across >> one or multiple registers. This is why I've only focused on `DW_OP_fbreg` >> expressions since I could get the offset of the variable and add it to the >> base pointer register to get its address. The variable address, and other >> metadata such as its size, its identifier and the DWARF Expression are saved >> to an `ArgumentMetadata` vector that will be used by the `ArgumentBuilder` >> to create the `$__lldb_arg` structure. >> Since all the registers are already mapped to a structure, I should >> be able to support more __DWARF Operations__ in the future. >> After collecting some metrics on the __Clang__ binary, built at __-O0__, >> the debug info shows that __99%__ of the most used DWARF Operations are : >> |DWARF Operation| Occurrences | >> |---------------|---------------------------| >> |DW\_OP_fbreg | 2 114 612 | >> |DW\_OP_reg | 820 548 | >> |DW\_OP_constu | 267 450 | >> |DW\_OP_addr | 17 370 | >> | __Top 4__ | __3 219 980 Occurrences__ | >> |---------------|---------------------------| >> | __Total__ | __3 236 859 Occurrences__ | >> Those 4 operations are the one that I'll support for now. >> To support more complex expressions, we would need to JIT-compile >> a DWARF expression interpreter. >> ### Unwinders >> When the program hits the injected trap instruction, the execution stops >> inside the injected UserExpression. >> ```cpp >> * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 >> * frame #0: 0x00000001001070b9 >> $__lldb_expr`$__lldb_expr($__lldb_arg=0x00000001f5671000) at >> lldb-33192c.expr:49 >> frame #1: 0x0000000100105028 >> ``` >> This part of the program should be transparent to user. To allow LLDB to >> elide the condition checker and the FCBT frame, the Unwinder needs to be >> able to identify all of the frames, up to the user's source code frame. >> The injected UserExpression already has a valid stack frame, but it doesn't >> have any information about its caller, the Trampoline. In order to unwind to >> the user's code, LLDB needs symbolic information for the trampoline. >> This information is tied to LLDB modules, created using an ObjectFile >> representation, the __ObjectFileTrampoline__ in our case. >> It will contain several pieces of information such as, the module's name and >> description, but most importantly the module __Symbol Table__ that will have >> the trampoline symbol (`$__lldb_injected_conditional_bp_trampoline `) and a >> __Text Section__ that will tell the unwinder the trampoline bounds. >> Then, LLDB inserts a __Function Unwinder__ in the module UnwindTable and >> creates an __Unwind Plan__ pointing to the BreakpointLocation return address. >> This is done taking into consideration that the trampoline will alter the >> memory layout by spilling registers to the stack. >> Finally, the newly created module is appended to the target image list, which >> allows LLDB to move between the injected code and the user code seamlessly. >> This is what the backtrace looks like after hitting the injected trap: >> ```cpp >> * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 >> frame #0: 0x00000001001070b9 >> $__lldb_expr`$__lldb_expr($__lldb_arg=0x00000001f4c71000) at >> lldb-ca98b7.expr:49 >> frame #1: 0x0000000100105028 >> $__lldb_injected_conditional_bp_trampoline`$__lldb_injected_conditional_bp_trampoline >> + 40 >> * frame #2: 0x0000000100000f5b main`main at main.c:7:23 >> ``` >> For now, LLDB selects the user frame but the goal would be to mask all the >> frames introduced by the Fast Conditional Breakpoint. >> A `debug-injected-condition` setting will allow to stop at the FCBT and show >> all the elided frames. > > Regarding unwinding, I am wondering whether we really need to do anything > really special. It sounds to me that if we try a little bit harder then we > could make the trampoline code look very much like a signal handler, and have > it be treated as such. Then the only special thing we would need to do is to > hide the topmost trampoline code somewhere higher up in the presentation > layer. > > I am imagining the trampoline code could look something like this (excuse my > bad assembly, I haven't written that in a while): > pushq %rax > pushq %rbx > ... > leaq $SIZE_OF_REGISTER_CONTEXT(%rsp), %r10 # void *registers > movq %rsp, %r11 # void *args > subq $SIZE_OF_ARGS, %rsp > movq %r10, %rdi > movq %r11, %rsi > callq __build_args # __build_args(const void *registers, void *args) > movq %r11, %rdi > callq __lldb_expr # __lldb_expr(void *args) > test %al, %al > jz .Ldone > trap_opcode: > int3 > .Ldone: > addq $SIZE_OF_ARGS, %rsp > pop everything, execute displaced instructions and jump back > > I think this trampoline is pretty similar to what you're proposing, but there > are a couple of subtle differences: > - the args structure is allocated on the stack - I already spoke about that > - the testing of the condition happens inside the trampoline > I think this second item has several advantages. Firstly, this means that we > hit the breakpoint, we only have one extra frame on the stack. So even if we > don't do any extra work in the debugger to hide this stuff, we don't clutter > the stack too much. > > Secondly, this means we can avoid the "dissasemble and scan for trap opcode" > step, which is kind of a hack -- after all, we generated these instructions, > so we should _know_ where the trap opcode is. This way, you can emit a > special symbol (trap_opcode label in the example above), that lldb can then > search for, and know it's location exactly. > I think testing the condition inside the trampoline might be very limiting: - The variable resolution would be need to be rethought to allow the condition check to happen in the trampoline. - To be able to support different condition types (expression / thread name / thread id …), the $__lldb_expr is a better option IMO. In the future, we might also inject logging code that would only be run according to the condition. - This feature requires at least one more frame (for your approach), that would still need to be hidden to the user. I don’t think hiding 2 frames is more work than hiding 1. > And lastly, and this is the most important advantage IMO, is that we are in > full control of the kind of unwind info we generate for the trampoline. We > can generate the proper eh_frame info for this trampoline which would > correctly describe the locations of the registers of the previous frame, so > that lldb would automatically be able to find them and display them properly > when you do for instance "register read" with the parent frame selected. > Hopefully, all this would take is a couple of well-placed .cfi assembler > instructions. > > Here, I'm imagining we could use the MC layer in llvm do do this thing, > either by feeding it a raw assembler string, or by using it's c++ api, > whichever is easier. Then we could feed this to the normal jit together with > the compiled c++ expression and it would link it all together and load it > into memory. > >> ### Instruction Shifter (WIP) >> Because some instructions might use operands that are at an offsets relative >> to the program counter, copying the instructions to a new location might >> change their meaning: >> LLDB needs to patch each instruction with the right offset. >> This is done using `LLVM::MCInst` tool in order to detect the instructions >> that need to be rewritten. >> ## Risk Mitigation >> The optimization relies heavily on code injection, most of which is >> architecture specific. Because of this, overwriting the instructions >> can fail depending of the breakpoint location, e.g.: >> - If the overwritten instructions contains indirection (branch instructions). >> - If the overwritten instructions are a branch target. >> - If there is not enough instructions to insert the branch instruction >> (x86_64) >> If the setup process fails to insert the Fast Conditional Breakpoint, it will >> fallback to the legacy behavior, and warn the user about what went wrong. > > Another possible fallback behavior would be to still do the whole trampoline > stuff and everything, but avoid needing to overwrite opcodes in the target by > having the gdb stub do this work for us. So, we could teach the stub that > some addresses are special and when a breakpoint at this location gets hit, > it should automatically change the program counter to some other location > (the address of our trampoline) and let the program continue. This way, you > would only need to insert a single trap instruction, which is what we know > how to do already. And I believe this would still bring a major speedup > compared to the current implementation (particularly if the target is remote > on a high-latency link, but even in the case of local debugging, I would > expect maybe an order of magnitude faster processing of conditional > breakpoints). > > This would be kind of similar to the "cond_list" in the gdb-remote > "Z0;addr,kind;cond_list" packet > <https://sourceware.org/gdb/onlinedocs/gdb/Packets.html>. > > In fact, given that this "instruction shifting" is the most unpredictable > part of this whole architecture (because we don't control the contents of the > inferior instructions), it might make sense to do this approach first, and > then do the instruction shifting as a follow-up. > >> One way to mitigate those limitations would be to use code instrumentation >> to detect if it's safe to set a Fast Condition Breakpoint at a certain >> location, and hint the user to move the FCB before or after the location >> where >> it was set originally. >> ## Prototype Code >> I submitted my patches ([1](reviews.llvm.org/D66248), >> [2](reviews.llvm.org/D66249), >> [3](reviews.llvm.org/D66250)) on Phabricator with the prototype. >> ## Feedback >> Before moving forward I'd like to get the community's input. What do you >> think about this approach? Any feedback would be greatly appreciated! >> Thanks, > > As my last suggestion, I would like to ask you to consider testing as you're > writing this code. This is a pretty complex machinery you're building, and it > would be nice if it was possible to test pieces of it in isolation instead of > just the large end-to-end kinds of tests. For example, in the "instruction > shifter" machinery, it would be nice to be able to take a single instruction, > execute both in place, and in a "shifted" location, and assert that the > resulting register contents are identical. Will do. > > regards, > pavel Thanks, Ismail. _______________________________________________ lldb-dev mailing list lldb-dev@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev