> On Aug 19, 2019, at 2:30 PM, Frédéric Riss <fr...@apple.com> wrote: > > > >> On Aug 16, 2019, at 11:13 AM, Ismail Bennani via lldb-dev >> <lldb-dev@lists.llvm.org> wrote: >> >> Hi Pavel, >> >> Thanks for all your feedbacks. >> >> I’ve been following the discussion closely and find your approach quite >> interesting. >> >> As Jim explained, I’m also trying to have a conditional breakpoint, that is >> able to stop a specific thread (name or id) when the condition expression >> evaluates to true. >> >> I feel like stacking up options with your approach would imply doing more >> context switches. >> But it’s definitely a better fallback mechanism than the current one. I’ll >> try to make a prototype to see the performance difference for both >> approaches. >> >> >>> On Aug 15, 2019, at 10:10 AM, Pavel Labath <pa...@labath.sk> wrote: >>> >>> Hello Ismail, and wellcome to LLDB. You have a very interesting (and not >>> entirely trivial) project, and I wish you the best of luck in your work. I >>> think this will be a very useful addition to lldb. >>> >>> It sounds like you have researched the problem very well, and the overall >>> direction looks good to me. However, I do have some ideas suggestions about >>> possible tweaks/improvements that I would like to hear your thoughts on. >>> Please find my comments inline. >>> >>> On 14/08/2019 22:52, Ismail Bennani via lldb-dev wrote: >>>> Hi everyone, >>>> I’m Ismail, a compiler engineer intern at Apple. As a part of my >>>> internship, >>>> I'm adding Fast Conditional Breakpoints to LLDB, using code patching. >>>> Currently, the expressions that power conditional breakpoints are lowered >>>> to LLVM IR and LLDB knows how to interpret a subset of it. If that fails, >>>> the debugger JIT-compiles the expression (compiled once, and re-run on each >>>> breakpoint hit). In both cases LLDB must collect all program state used in >>>> the condition and pass it to the expression. >>>> The goal of my internship project is to make conditional breakpoints >>>> faster by: >>>> 1. Compiling the expression ahead-of-time, when setting the breakpoint and >>>> inject into the inferior memory only once. >>>> 2. Re-route the inferior execution flow to run the expression and check >>>> whether >>>> it needs to stop, in-process. >>>> This saves the cost of having to do the context switch between debugger and >>>> the inferior program (about 10 times) to compile and evaluate the >>>> condition. >>>> This feature is described on the [LLDB Project >>>> page](https://lldb.llvm.org/status/projects.html#use-the-jit-to-speed-up-conditional-breakpoint-evaluation). >>>> The goal would be to have it working for most languages and architectures >>>> supported by LLDB, however my original implementation will be for C-based >>>> languages targeting x86_64. It will be extended to AArch64 afterwards. >>>> Note the way my prototype is implemented makes it fully extensible for >>>> other >>>> languages and architectures. >>>> ## High Level Design >>>> Every time a breakpoint that holds a condition is hit, multiple context >>>> switches are needed in order to compile and evaluate the condition. >>>> First, the breakpoint is hit and the control is given to the debugger. >>>> That's where LLDB wraps the condition expression into a UserExpression that >>>> will get compiled and injected into the program memory. Another round-trip >>>> between the inferior and the LLDB is needed to run the compiled expression >>>> and extract the expression results that will tell LLDB to stop or not. >>>> To get rid of those context switches, we will evaluate the condition inside >>>> the program, and only stop when the condition is true. LLDB will achieve >>>> this >>>> by inserting a jump from the breakpoint address to a code section that will >>>> be allocated into the program memory. It will save the thread state, run >>>> the >>>> condition expression, restore the thread state and then execute the copied >>>> instruction(s) before jumping back to the regular program flow. >>>> Then we only trap and return control to LLDB when the condition is true. >>>> ## Implementation Details >>>> To be able to evaluate a breakpoint condition without interacting with the >>>> debugger, LLDB changes the inferior program execution flow by overwriting >>>> the instruction at which the breakpoint was set with a branching >>>> instruction. >>>> The original instruction(s) are copied to a memory stub allocated in the >>>> inferior program memory called the __Fast Conditional Breakpoint >>>> Trampoline__ >>>> or __FCBT__. The FCBT will allow us the re-route the program execution >>>> flow to >>>> check the condition in-process while preserving the original program >>>> behavior. >>>> This part is critical to setup Fast Conditional Breakpoints. >>>> ``` >>>> Inferior Binary Trampoline >>>> | . | >>>> +-------------------------+ >>>> | . | | >>>> | >>>> | . | +--------->+ Save RegisterContext >>>> | >>>> | . | | | >>>> | >>>> +-------------------------+ | >>>> +-------------------------+ >>>> | | | | >>>> | >>>> | Instruction | | | Build Arguments Struct >>>> | >>>> | | | | >>>> | >>>> +-------------------------+ | >>>> +-------------------------+ >>>> | +-----------+ | >>>> | >>>> | Branch to Trampoline | | Call Condition Checker >>>> | >>>> | +<----------+ | >>>> | >>>> +-------------------------+ | >>>> +-------------------------+ >>>> | | | | >>>> | >>>> | Instruction | | | Restore RegisterContext >>>> | >>>> | | | | >>>> | >>>> +-------------------------+ | >>>> +-------------------------+ >>>> | . | | | >>>> | >>>> | . | +----------+ Run Copied Instructions >>>> | >>>> | . | | >>>> | >>>> | . | >>>> +-------------------------+ >>>> ``` >>>> Once the execution reaches the Trampoline, several steps need to be taken. >>>> LLDB relies on its UserExpressions to JIT these more complex conditional >>>> expressions. However, since the execution will be handled by the debugged >>>> program, LLDB will generate some code ahead-of-time in theTrampoline that >>>> will allow the inferior to initialize the expression's argument structure. >>>> Generating the condition checker as well as the code to initialize >>>> the argument structure of each breakpoint hit is handled by >>>> __BreakpointInjectedSite__ class, which builds the conditional expression >>>> for >>>> all the BreakpointLocations, emits the `$__lldb_expr` function, and >>>> relocates >>>> variables in the `$__lldb_arg` structure. >>>> BreakpointInjectedSites are created in the __Process__ if the user enables >>>> the `-I | --inject-condition` flag when setting or modifying a breakpoint. >>>> Because the __FCBT__ is architecture specific, BreakpointInjectedSites will >>>> only be available when a target has added support to it, in the matching >>>> Architecture Plugin. >>>> Several parts of lldb have to be modified to implement this feature: >>>> - **Breakpoint**: Added BreakpointInjectedSite, and helper functions to the >>>> related class (Breakpoint, BreakpointLocation, >>>> BreakpointSite, BreakpointOptions) >>>> - **Plugins**: Added ObjectFileTrampoline for the unwinding >>>> Added x86_64 ABI support (FCBT setup & safety checks) >>>> - **Symbol**: Changed `FuncUnwinders` and `UnwindPlan` to support FCBT >>>> - **Target**: Added BreakpointInjectedSite creation to `Process` to >>>> insert >>>> the jump to the FCBT >>>> Added the Trampoline module creation to `ABI` for the >>>> unwinding >>>> ### Breakpoint Option >>>> Since Fast Conditional Breakpoints are still under development, they will >>>> not >>>> be on by default, but rather we will provide a flag to 'breakpoint set" and >>>> "breakpoint modify" to enable the feature. Note that the end-goal is to >>>> have >>>> them as a default and only fallback to regular conditional breakpoints on >>>> unsupported architectures. >>>> They can be enabled when using `-I | --inject-condition` option. These >>>> options >>>> can also be enabled using the Python Scripting Bridge public API, using the >>>> `InjectCondition(bool enable)` method on an __SBBreakpoint__ or >>>> __SBBreakpointLocation__ object. >>>> This feature is intended to be used with condition expression >>>> (`-c <expr> | --condition <expr>`), but also other conditions types such >>>> as: >>>> - Thread ID (`-t <thread-id> | --thread-id <thread-id>`) >>>> - Thread Index (`-x <thread-index> | --thread-index <thread-index>`) >>>> - Thread Queue Name >>>> ### Trampoline >>>> To be able to inject the condition, we need to re-route the debugged >>>> program's >>>> execution flow. This parts is handled in the __Trampoline__, a memory stub >>>> allocated in the inferior that will contain the condition check while >>>> preserving the program's original behavior. >>>> The trampoline is architecture specific and built by lldb. To have the >>>> condition evaluation work out-of-place, several steps need to be completed: >>>> 1. Save all the registers by pushing them to the stack >>>> 2. Build the `$__lldb_arg` structure by calling a injected UtilityFunction >>>> 3. Check the condition by calling the injected UserExpression and execute a >>>> trap if the condition is true. >>>> 4. Restore register context >>>> 5. Rewrite and run original copied instructions operands >>>> All the values needed for the steps can be computed ahead of time, when the >>>> breakpoint is set (i.e: size of the allocation, jump address, relocation >>>> ...). >>>> Since the x86_64 ISA has variable instruction size, LLDB moves enough >>>> instructions in the trampoline to be able to overwrite them with a jump to >>>> the >>>> trampoline. Also, the allocation region for the trampoline might be too far >>>> away for a single jump, so we might need to have several branch island >>>> before >>>> reaching the trampoline (WIP). >>>> ### BreakpointInjectedSite >>>> To handle the Fast Conditional Breakpoint setup, LLDB uses >>>> __BreakpointInjectedSites__ which is a sub-class of the BreakpointSite >>>> class. >>>> BreakpointInjectedSites uses different `UserExpression` to resolve >>>> variables >>>> and inject the condition checker. >>>> #### Condition Checker >>>> Because a BreakpointSite can have multiple BreakpointLocations with >>>> different >>>> conditions, LLDB need first iterate over each owner of the BreakpointSite >>>> and >>>> gather all the conditions. If one of the BreakpointLocations doesn't have a >>>> condition or the condition is not set to be injected, the >>>> BreakpointInjectedSite will behave as a regular BreakpointSite. >>>> Once all the conditions are fetched, LLDB will create a __UserExpression__ >>>> with the injected trap instruction. >>>> When a trap is hit, LLDB uses the __BreakpointSiteList__, a map from a trap >>>> address to a BreakpointSite to identify where to stop. To allow LLDB to >>>> catch >>>> the injected trap at runtime, it will disassemble the compiled expression >>>> and >>>> scan for the trap address. The injected trap address is then added to >>>> LLDB's >>>> __BreakpointSiteList__. >>>> When generated, this is what the condition checker looks like: >>>> ```cpp >>>> void $__lldb_expr(void *$__lldb_arg) >>>> { >>>> /*lldb_BODY_START*/ >>>> if (condition) { >>>> __builtin_debugtrap(); >>>> }; >>>> /*lldb_BODY_END*/ >>>> } >>>> ``` >>>> #### Argument Builder >>>> The conditional expression will often refer to local variables, and the >>>> references to these variables need to be tied to the instances of them in >>>> the >>>> current frame. >>>> Usually the expression evaluator invokes the __Materializer__ which fetches >>>> the variables values and fills the `$__lldb_arg` structure. But since we >>>> don't >>>> want to switch contexts, LLDB has to resolve used variables by generating >>>> code >>>> that will initialize the `$__lldb_arg` pointer, before running the >>>> condition >>>> checker. >>>> That's where the __Argument Builder__ comes in. >>>> The argument builder uses an `UtilityFunction` to generate the >>>> `$__lldb_create_args_struct` function. It is called by the Trampoline >>>> before the condition checker, in order to resolve variables used in the >>>> condition expression. >>>> `$__lldb_create_args_struct` will fill the `$__lldb_arg` in several steps: >>>> 1. It takes advantage of the fact that LLDB saved all the registers to the >>>> stack and map them in an `register_context` structure. >>>> ```cpp >>>> typedef struct { >>>> // General Purpose Registers >>>> } register_context; >>>> ``` >>>> 2. Using information from the variable resolver, it allocates a memory >>>> stub >>>> that will contain the used variable addresses. >>>> 3. Then, it will use the register values and the collected metadata to >>>> compute the used variable address and write that into the >>>> newly allocated structure. >>>> 4. Finally the allocated structure is returned to the trampoline, which >>>> will >>>> pass it as an argument to the injected condition checker. >>> I am wondering whether we really need to involve the memory allocation >>> functions here. What's the size of this address structure? I would expect >>> it to be relatively small compared to the size of the entire register >>> context that we have just saved to the stack. If that's the case, the case >>> then maybe we could have the trampoline allocate some space on the stack >>> and pass that as an argument to the $__lldb_arg building code. >> >> Allocating the $__lldb_arg struct in the stack is on my to-do list. This >> will change in the coming revisions. >> >>> >>>> Since `$__lldb_create_args_struct` uses the same JIT Engine as the >>>> UserExpression, LLDB will parse, build and insert it in the program memory. >>>> #### Variable Resolver >>>> When creating a Fast Conditional Breakpoint, the __debug info__ tells us >>>> where the used variables are located. Using this information and the saved >>>> register context, we can generate code that will resolve the variables at >>>> runtime (__Step 3 of the Argument Builder__). >>>> LLDB will first get the `DeclMap` from the condition UserExpression and >>>> pull a >>>> list of the used variables. While iterating on that list, LLDB extracts >>>> each >>>> variable's __DWARF Expression__. >>>> DWARF expressions explain how to reconstruct a variable's values using >>>> DWARF >>>> operations. >>>> The reason why LLDB needs the register context is because local variable >>>> are >>>> often at an offset of the __Stack Base Pointer register__ or written across >>>> one or multiple registers. This is why I've only focused on `DW_OP_fbreg` >>>> expressions since I could get the offset of the variable and add it to the >>>> base pointer register to get its address. The variable address, and other >>>> metadata such as its size, its identifier and the DWARF Expression are >>>> saved >>>> to an `ArgumentMetadata` vector that will be used by the `ArgumentBuilder` >>>> to create the `$__lldb_arg` structure. >>>> Since all the registers are already mapped to a structure, I should >>>> be able to support more __DWARF Operations__ in the future. >>>> After collecting some metrics on the __Clang__ binary, built at __-O0__, >>>> the debug info shows that __99%__ of the most used DWARF Operations are : >>>> |DWARF Operation| Occurrences | >>>> |---------------|---------------------------| >>>> |DW\_OP_fbreg | 2 114 612 | >>>> |DW\_OP_reg | 820 548 | >>>> |DW\_OP_constu | 267 450 | >>>> |DW\_OP_addr | 17 370 | >>>> | __Top 4__ | __3 219 980 Occurrences__ | >>>> |---------------|---------------------------| >>>> | __Total__ | __3 236 859 Occurrences__ | >>>> Those 4 operations are the one that I'll support for now. >>>> To support more complex expressions, we would need to JIT-compile >>>> a DWARF expression interpreter. >>>> ### Unwinders >>>> When the program hits the injected trap instruction, the execution stops >>>> inside the injected UserExpression. >>>> ```cpp >>>> * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 >>>> * frame #0: 0x00000001001070b9 >>>> $__lldb_expr`$__lldb_expr($__lldb_arg=0x00000001f5671000) at >>>> lldb-33192c.expr:49 >>>> frame #1: 0x0000000100105028 >>>> ``` >>>> This part of the program should be transparent to user. To allow LLDB to >>>> elide the condition checker and the FCBT frame, the Unwinder needs to be >>>> able to identify all of the frames, up to the user's source code frame. >>>> The injected UserExpression already has a valid stack frame, but it doesn't >>>> have any information about its caller, the Trampoline. In order to unwind >>>> to >>>> the user's code, LLDB needs symbolic information for the trampoline. >>>> This information is tied to LLDB modules, created using an ObjectFile >>>> representation, the __ObjectFileTrampoline__ in our case. >>>> It will contain several pieces of information such as, the module's name >>>> and >>>> description, but most importantly the module __Symbol Table__ that will >>>> have >>>> the trampoline symbol (`$__lldb_injected_conditional_bp_trampoline `) and a >>>> __Text Section__ that will tell the unwinder the trampoline bounds. >>>> Then, LLDB inserts a __Function Unwinder__ in the module UnwindTable and >>>> creates an __Unwind Plan__ pointing to the BreakpointLocation return >>>> address. >>>> This is done taking into consideration that the trampoline will alter the >>>> memory layout by spilling registers to the stack. >>>> Finally, the newly created module is appended to the target image list, >>>> which >>>> allows LLDB to move between the injected code and the user code seamlessly. >>>> This is what the backtrace looks like after hitting the injected trap: >>>> ```cpp >>>> * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 >>>> frame #0: 0x00000001001070b9 >>>> $__lldb_expr`$__lldb_expr($__lldb_arg=0x00000001f4c71000) at >>>> lldb-ca98b7.expr:49 >>>> frame #1: 0x0000000100105028 >>>> $__lldb_injected_conditional_bp_trampoline`$__lldb_injected_conditional_bp_trampoline >>>> + 40 >>>> * frame #2: 0x0000000100000f5b main`main at main.c:7:23 >>>> ``` >>>> For now, LLDB selects the user frame but the goal would be to mask all the >>>> frames introduced by the Fast Conditional Breakpoint. >>>> A `debug-injected-condition` setting will allow to stop at the FCBT and >>>> show >>>> all the elided frames. >>> >>> Regarding unwinding, I am wondering whether we really need to do anything >>> really special. It sounds to me that if we try a little bit harder then we >>> could make the trampoline code look very much like a signal handler, and >>> have it be treated as such. Then the only special thing we would need to do >>> is to hide the topmost trampoline code somewhere higher up in the >>> presentation layer. >>> >>> I am imagining the trampoline code could look something like this (excuse >>> my bad assembly, I haven't written that in a while): >>> pushq %rax >>> pushq %rbx >>> ... >>> leaq $SIZE_OF_REGISTER_CONTEXT(%rsp), %r10 # void *registers >>> movq %rsp, %r11 # void *args >>> subq $SIZE_OF_ARGS, %rsp >>> movq %r10, %rdi >>> movq %r11, %rsi >>> callq __build_args # __build_args(const void *registers, void *args) >>> movq %r11, %rdi >>> callq __lldb_expr # __lldb_expr(void *args) >>> test %al, %al >>> jz .Ldone >>> trap_opcode: >>> int3 >>> .Ldone: >>> addq $SIZE_OF_ARGS, %rsp >>> pop everything, execute displaced instructions and jump back >>> >>> I think this trampoline is pretty similar to what you're proposing, but >>> there are a couple of subtle differences: >>> - the args structure is allocated on the stack - I already spoke about that >>> - the testing of the condition happens inside the trampoline >>> I think this second item has several advantages. Firstly, this means that >>> we hit the breakpoint, we only have one extra frame on the stack. So even >>> if we don't do any extra work in the debugger to hide this stuff, we don't >>> clutter the stack too much. >>> >>> Secondly, this means we can avoid the "dissasemble and scan for trap >>> opcode" step, which is kind of a hack -- after all, we generated these >>> instructions, so we should _know_ where the trap opcode is. This way, you >>> can emit a special symbol (trap_opcode label in the example above), that >>> lldb can then search for, and know it's location exactly. >>> >> >> I think testing the condition inside the trampoline might be very limiting: >> - The variable resolution would be need to be rethought to allow the >> condition check to happen in the trampoline. >> - To be able to support different condition types (expression / thread name >> / thread id …), the $__lldb_expr is a better option IMO. In the future, we >> might also inject logging code that would only be run according to the >> condition. >> - This feature requires at least one more frame (for your approach), that >> would still need to be hidden to the user. I don’t think hiding 2 frames is >> more work than hiding 1. > > I might be the one misunderstanding, but I think you missed Pavel’s point. In > Pavel’s model, you still JIT the condition into __llldb_expr and pas it the > argument structure. The difference is that you don’t have the trap inside of > the JITed code, you have the JITed code return whether to stop or not and > have the trampoline hit the trap depending in the return value. I agree this > seems cleaner than scanning the output to find the trap.
Inserting the trap in the trampoline would still require to fetch the $__lldb_expr's return value (architecture-specific) and write an assembly check statement (compare and jump). Right now, all of this is abstracted by the UserExpression. I do agree that it’s cleaner, and will take it into consideration for my next patches. > > Fred > >>> And lastly, and this is the most important advantage IMO, is that we are in >>> full control of the kind of unwind info we generate for the trampoline. We >>> can generate the proper eh_frame info for this trampoline which would >>> correctly describe the locations of the registers of the previous frame, so >>> that lldb would automatically be able to find them and display them >>> properly when you do for instance "register read" with the parent frame >>> selected. Hopefully, all this would take is a couple of well-placed .cfi >>> assembler instructions. >>> >>> Here, I'm imagining we could use the MC layer in llvm do do this thing, >>> either by feeding it a raw assembler string, or by using it's c++ api, >>> whichever is easier. Then we could feed this to the normal jit together >>> with the compiled c++ expression and it would link it all together and load >>> it into memory. >>> >>>> ### Instruction Shifter (WIP) >>>> Because some instructions might use operands that are at an offsets >>>> relative >>>> to the program counter, copying the instructions to a new location might >>>> change their meaning: >>>> LLDB needs to patch each instruction with the right offset. >>>> This is done using `LLVM::MCInst` tool in order to detect the instructions >>>> that need to be rewritten. >>>> ## Risk Mitigation >>>> The optimization relies heavily on code injection, most of which is >>>> architecture specific. Because of this, overwriting the instructions >>>> can fail depending of the breakpoint location, e.g.: >>>> - If the overwritten instructions contains indirection (branch >>>> instructions). >>>> - If the overwritten instructions are a branch target. >>>> - If there is not enough instructions to insert the branch instruction >>>> (x86_64) >>>> If the setup process fails to insert the Fast Conditional Breakpoint, it >>>> will >>>> fallback to the legacy behavior, and warn the user about what went wrong. >>> >>> Another possible fallback behavior would be to still do the whole >>> trampoline stuff and everything, but avoid needing to overwrite opcodes in >>> the target by having the gdb stub do this work for us. So, we could teach >>> the stub that some addresses are special and when a breakpoint at this >>> location gets hit, it should automatically change the program counter to >>> some other location (the address of our trampoline) and let the program >>> continue. This way, you would only need to insert a single trap >>> instruction, which is what we know how to do already. And I believe this >>> would still bring a major speedup compared to the current implementation >>> (particularly if the target is remote on a high-latency link, but even in >>> the case of local debugging, I would expect maybe an order of magnitude >>> faster processing of conditional breakpoints). >>> >>> This would be kind of similar to the "cond_list" in the gdb-remote >>> "Z0;addr,kind;cond_list" packet >>> <https://sourceware.org/gdb/onlinedocs/gdb/Packets.html>. >>> >>> In fact, given that this "instruction shifting" is the most unpredictable >>> part of this whole architecture (because we don't control the contents of >>> the inferior instructions), it might make sense to do this approach first, >>> and then do the instruction shifting as a follow-up. >>> >>>> One way to mitigate those limitations would be to use code instrumentation >>>> to detect if it's safe to set a Fast Condition Breakpoint at a certain >>>> location, and hint the user to move the FCB before or after the location >>>> where >>>> it was set originally. >>>> ## Prototype Code >>>> I submitted my patches ([1](reviews.llvm.org/D66248), >>>> [2](reviews.llvm.org/D66249), >>>> [3](reviews.llvm.org/D66250)) on Phabricator with the prototype. >>>> ## Feedback >>>> Before moving forward I'd like to get the community's input. What do you >>>> think about this approach? Any feedback would be greatly appreciated! >>>> Thanks, >>> >>> As my last suggestion, I would like to ask you to consider testing as >>> you're writing this code. This is a pretty complex machinery you're >>> building, and it would be nice if it was possible to test pieces of it in >>> isolation instead of just the large end-to-end kinds of tests. For example, >>> in the "instruction shifter" machinery, it would be nice to be able to take >>> a single instruction, execute both in place, and in a "shifted" location, >>> and assert that the resulting register contents are identical. >> >> Will do. >> >>> >>> regards, >>> pavel >> >> Thanks, >> >> Ismail. >> >> _______________________________________________ >> lldb-dev mailing list >> lldb-dev@lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev Ismail _______________________________________________ lldb-dev mailing list lldb-dev@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev