Re: [lldb-dev] [RFC] Fast Conditional Breakpoints (FCB)

Ismail Bennani via lldb-dev Fri, 16 Aug 2019 11:14:49 -0700

Hi Pavel,

Thanks for all your feedbacks.


I’ve been following the discussion closely and find your approach quite 
interesting.

As Jim explained, I’m also trying to have a conditional breakpoint, that is 
able to stop a specific thread (name or id) when the condition expression 
evaluates to true.

I feel like stacking up options with your approach would imply doing more 
context switches.
But it’s definitely a better fallback mechanism than the current one. I’ll try 
to make a prototype to see the performance difference for both approaches.


> On Aug 15, 2019, at 10:10 AM, Pavel Labath <pa...@labath.sk> wrote:
> 
> Hello Ismail, and wellcome to LLDB. You have a very interesting (and not 
> entirely trivial) project, and I wish you the best of luck in your work. I 
> think this will be a very useful addition to lldb.
> 
> It sounds like you have researched the problem very well, and the overall 
> direction looks good to me. However, I do have some ideas suggestions about 
> possible tweaks/improvements that I would like to hear your thoughts on. 
> Please find my comments inline.
> 
> On 14/08/2019 22:52, Ismail Bennani via lldb-dev wrote:
>> Hi everyone,
>> I’m Ismail, a compiler engineer intern at Apple. As a part of my internship,
>> I'm adding Fast Conditional Breakpoints to LLDB, using code patching.
>> Currently, the expressions that power conditional breakpoints are lowered
>> to LLVM IR and LLDB knows how to interpret a subset of it. If that fails,
>> the debugger JIT-compiles the expression (compiled once, and re-run on each
>> breakpoint hit). In both cases LLDB must collect all program state used in
>> the condition and pass it to the expression.
>> The goal of my internship project is to make conditional breakpoints faster 
>> by:
>> 1. Compiling the expression ahead-of-time, when setting the breakpoint and
>>    inject into the inferior memory only once.
>> 2. Re-route the inferior execution flow to run the expression and check 
>> whether
>>    it needs to stop, in-process.
>> This saves the cost of having to do the context switch between debugger and
>> the inferior program (about 10 times) to compile and evaluate the condition.
>> This feature is described on the [LLDB Project 
>> page](https://lldb.llvm.org/status/projects.html#use-the-jit-to-speed-up-conditional-breakpoint-evaluation).
>> The goal would be to have it working for most languages and architectures
>> supported by LLDB, however my original implementation will be for C-based
>> languages targeting x86_64. It will be extended to AArch64 afterwards.
>> Note the way my prototype is implemented makes it fully extensible for other
>> languages and architectures.
>> ## High Level Design
>> Every time a breakpoint that holds a condition is hit, multiple context
>> switches are needed in order to compile and evaluate the condition.
>> First, the breakpoint is hit and the control is given to the debugger.
>> That's where LLDB wraps the condition expression into a UserExpression that
>> will get compiled and injected into the program memory. Another round-trip
>> between the inferior and the LLDB is needed to run the compiled expression
>> and extract the expression results that will tell LLDB to stop or not.
>> To get rid of those context switches, we will evaluate the condition inside
>> the program, and only stop when the condition is true. LLDB will achieve this
>> by inserting a jump from the breakpoint address to a code section that will
>> be allocated into the program memory. It will save the thread state, run the
>> condition expression, restore the thread state and then execute the copied
>> instruction(s) before jumping back to the regular program flow.
>> Then we only trap and return control to LLDB when the condition is true.
>> ## Implementation Details
>> To be able to evaluate a breakpoint condition without interacting with the
>> debugger, LLDB changes the inferior program execution flow by overwriting
>> the instruction at which the breakpoint was set with a branching instruction.
>> The original instruction(s) are copied to a memory stub allocated in the
>> inferior program memory called the __Fast Conditional Breakpoint Trampoline__
>> or __FCBT__. The FCBT will allow us the re-route the program execution flow 
>> to
>> check the condition in-process while preserving the original program 
>> behavior.
>> This part is critical to setup Fast Conditional Breakpoints.
>> ```
>>       Inferior Binary                                     Trampoline
>> |            .            |                      +-------------------------+
>> |            .            |                      |                         |
>> |            .            |           +--------->+   Save RegisterContext  |
>> |            .            |           |          |                         |
>> +-------------------------+           |          +-------------------------+
>> |                         |           |          |                         |
>> |       Instruction       |           |          |  Build Arguments Struct |
>> |                         |           |          |                         |
>> +-------------------------+           |          +-------------------------+
>> |                         +-----------+          |                         |
>> |   Branch to Trampoline  |                      |  Call Condition Checker |
>> |                         +<----------+          |                         |
>> +-------------------------+           |          +-------------------------+
>> |                         |           |          |                         |
>> |       Instruction       |           |          | Restore RegisterContext |
>> |                         |           |          |                         |
>> +-------------------------+           |          +-------------------------+
>> |            .            |           |          |                         |
>> |            .            |           +----------+ Run Copied Instructions |
>> |            .            |                      |                         |
>> |            .            |                      +-------------------------+
>> ```
>> Once the execution reaches the Trampoline, several steps need to be taken.
>> LLDB relies on its UserExpressions to JIT these more complex conditional
>> expressions. However, since the execution will be handled by the debugged
>> program, LLDB will generate some code ahead-of-time in theTrampoline that
>> will allow the inferior to initialize the expression's argument structure.
>> Generating the condition checker as well as the code to initialize
>> the argument structure of each breakpoint hit is handled by
>> __BreakpointInjectedSite__ class, which builds the conditional expression for
>> all the BreakpointLocations, emits the `$__lldb_expr` function, and relocates
>> variables in the `$__lldb_arg` structure.
>> BreakpointInjectedSites are created in the __Process__ if the user enables
>> the `-I | --inject-condition` flag when setting or modifying a breakpoint.
>> Because the __FCBT__ is architecture specific, BreakpointInjectedSites will
>> only be available when a target has added support to it, in the matching
>> Architecture Plugin.
>> Several parts of lldb have to be modified to implement this feature:
>> - **Breakpoint**: Added BreakpointInjectedSite, and helper functions to the
>>                   related class (Breakpoint, BreakpointLocation,
>>                   BreakpointSite, BreakpointOptions)
>> - **Plugins**:    Added ObjectFileTrampoline for the unwinding
>>                   Added x86_64 ABI support (FCBT setup & safety checks)
>> - **Symbol**:     Changed `FuncUnwinders` and `UnwindPlan` to support FCBT
>> - **Target**:     Added BreakpointInjectedSite creation to `Process` to 
>> insert
>>                   the jump to the FCBT
>>                   Added the Trampoline module creation to `ABI` for the
>>                   unwinding
>> ### Breakpoint Option
>> Since Fast Conditional Breakpoints are still under development, they will not
>> be on by default, but rather we will provide a flag to 'breakpoint set" and
>> "breakpoint modify" to enable the feature. Note that the end-goal is to have
>> them as a default and only fallback to regular conditional breakpoints on
>> unsupported architectures.
>> They can be enabled when using `-I | --inject-condition` option. These 
>> options
>> can also be enabled using the Python Scripting Bridge public API, using the
>> `InjectCondition(bool enable)` method on an __SBBreakpoint__ or
>> __SBBreakpointLocation__ object.
>> This feature is intended to be used with condition expression
>> (`-c <expr> | --condition <expr>`), but also other conditions types such as:
>>  - Thread ID (`-t <thread-id> | --thread-id <thread-id>`)
>>  - Thread Index (`-x <thread-index> | --thread-index <thread-index>`)
>>  - Thread Queue Name
>> ### Trampoline
>> To be able to inject the condition, we need to re-route the debugged 
>> program's
>> execution flow. This parts is handled in the __Trampoline__, a memory stub
>> allocated in the inferior that will contain the condition check while
>> preserving the program's original behavior.
>> The trampoline is architecture specific and built by lldb. To have the
>> condition evaluation work out-of-place, several steps need to be completed:
>> 1. Save all the registers by pushing them to the stack
>> 2. Build the `$__lldb_arg` structure by calling a injected UtilityFunction
>> 3. Check the condition by calling the injected UserExpression and execute a
>>    trap if the condition is true.
>> 4. Restore register context
>> 5. Rewrite and run original copied instructions operands
>> All the values needed for the steps can be computed ahead of time, when the
>> breakpoint is set (i.e: size of the allocation, jump address, relocation 
>> ...).
>> Since the x86_64 ISA has variable instruction size, LLDB moves enough
>> instructions in the trampoline to be able to overwrite them with a jump to 
>> the
>> trampoline. Also, the allocation region for the trampoline might be too far
>> away for a single jump, so we might need to have several branch island before
>> reaching the trampoline (WIP).
>> ### BreakpointInjectedSite
>> To handle the Fast Conditional Breakpoint setup, LLDB uses
>> __BreakpointInjectedSites__ which is a sub-class of the BreakpointSite class.
>> BreakpointInjectedSites uses different `UserExpression` to resolve variables
>> and inject the condition checker.
>> #### Condition Checker
>> Because a BreakpointSite can have multiple BreakpointLocations with different
>> conditions, LLDB need first iterate over each owner of the BreakpointSite and
>> gather all the conditions. If one of the BreakpointLocations doesn't have a
>> condition or the condition is not set to be injected, the
>> BreakpointInjectedSite will behave as a regular BreakpointSite.
>> Once all the conditions are fetched, LLDB will create a __UserExpression__
>> with the injected trap instruction.
>> When a trap is hit, LLDB uses the __BreakpointSiteList__, a map from a trap
>> address to a BreakpointSite to identify where to stop. To allow LLDB to catch
>> the injected trap at runtime, it will disassemble the compiled expression and
>> scan for the trap address. The injected trap address is then added to LLDB's
>> __BreakpointSiteList__.
>> When generated, this is what the condition checker looks like:
>> ```cpp
>> void $__lldb_expr(void *$__lldb_arg)
>> {
>>     /*lldb_BODY_START*/
>>     if (condition) {
>>         __builtin_debugtrap();
>>     };
>>     /*lldb_BODY_END*/
>> }
>> ```
>> #### Argument Builder
>> The conditional expression will often refer to local variables, and the
>> references to these variables need to be tied to the instances of them in the
>> current frame.
>> Usually the expression evaluator invokes the __Materializer__ which fetches
>> the variables values and fills the `$__lldb_arg` structure. But since we 
>> don't
>> want to switch contexts, LLDB has to resolve used variables by generating 
>> code
>> that will initialize the `$__lldb_arg` pointer, before running the condition
>> checker.
>> That's where the __Argument Builder__ comes in.
>> The argument builder uses an `UtilityFunction` to generate the
>> `$__lldb_create_args_struct` function. It is called by the Trampoline
>> before the condition checker, in order to resolve variables used in the
>> condition expression.
>> `$__lldb_create_args_struct` will fill the `$__lldb_arg` in several steps:
>> 1. It takes advantage of the fact that LLDB saved all the registers to the
>>    stack and map them in an `register_context` structure.
>>     ```cpp
>>     typedef struct {
>>     // General Purpose Registers
>>     } register_context;
>>     ```
>>     2. Using information from the variable resolver, it allocates a memory 
>> stub
>>    that will contain the used variable addresses.
>> 3. Then, it will use the register values and the collected metadata to
>>    compute the used variable address and write that into the
>>    newly allocated structure.
>> 4. Finally the allocated structure is returned to the trampoline, which will
>>    pass it as an argument to the injected condition checker.
> I am wondering whether we really need to involve the memory allocation 
> functions here. What's the size of this address structure? I would expect it 
> to be relatively small compared to the size of the entire register context 
> that we have just saved to the stack. If that's the case, the case then maybe 
> we could have the trampoline allocate some space on the stack and pass that 
> as an argument to the $__lldb_arg building code.

Allocating the $__lldb_arg struct in the stack is on my to-do list. This will 
change in the coming revisions.

> 
>> Since `$__lldb_create_args_struct` uses the same JIT Engine as the
>> UserExpression, LLDB will parse, build and insert it in the program memory.
>>  #### Variable Resolver
>> When creating a Fast Conditional Breakpoint, the __debug info__ tells us
>> where the used variables are located. Using this information and the saved
>> register context, we can generate code that will resolve the variables at
>> runtime (__Step 3 of the Argument Builder__).
>> LLDB will first get the `DeclMap` from the condition UserExpression and pull 
>> a
>> list of the used variables. While iterating on that list, LLDB extracts each
>> variable's __DWARF Expression__.
>> DWARF expressions explain how to reconstruct a variable's values using DWARF
>> operations.
>> The reason why LLDB needs the register context is because local variable are
>> often at an offset of the __Stack Base Pointer register__ or written across
>> one or multiple registers. This is why I've only focused on `DW_OP_fbreg`
>> expressions since I could get the offset of the variable and add it to the
>> base pointer register to get its address. The variable address, and other
>> metadata such as its size, its identifier and the DWARF Expression are saved
>> to an `ArgumentMetadata` vector that will be used by the `ArgumentBuilder`
>> to create the `$__lldb_arg` structure.
>> Since all the registers are already mapped to a structure, I should
>> be able to support more __DWARF Operations__ in the future.
>> After collecting some metrics on the __Clang__ binary, built at __-O0__,
>> the debug info shows that __99%__ of the most used DWARF Operations are :
>> |DWARF Operation|         Occurrences       |
>> |---------------|---------------------------|
>> |DW\_OP_fbreg   |         2 114 612         |
>> |DW\_OP_reg     |           820 548         |
>> |DW\_OP_constu  |           267 450         |
>> |DW\_OP_addr    |            17 370         |
>> |   __Top 4__   | __3 219 980 Occurrences__ |
>> |---------------|---------------------------|
>> |   __Total__   | __3 236 859 Occurrences__ |
>> Those 4 operations are the one that I'll support for now.
>> To support more complex expressions, we would need to JIT-compile
>> a DWARF expression interpreter.
>> ### Unwinders
>> When the program hits the injected trap instruction, the execution stops
>> inside the injected UserExpression.
>> ```cpp
>> * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
>>   * frame #0: 0x00000001001070b9 
>> $__lldb_expr`$__lldb_expr($__lldb_arg=0x00000001f5671000) at 
>> lldb-33192c.expr:49
>>     frame #1: 0x0000000100105028
>> ```
>> This part of the program should be transparent to user. To allow LLDB to
>> elide the condition checker and the FCBT frame, the Unwinder needs to be
>> able to identify all of the frames, up to the user's source code frame.
>> The injected UserExpression already has a valid stack frame, but it doesn't
>> have any information about its caller, the Trampoline. In order to unwind to
>> the user's code, LLDB needs symbolic information for the trampoline.
>> This information is tied to LLDB modules, created using an ObjectFile
>> representation, the __ObjectFileTrampoline__ in our case.
>> It will contain several pieces of information such as, the module's name and
>> description, but most importantly the module __Symbol Table__ that will have
>> the trampoline symbol (`$__lldb_injected_conditional_bp_trampoline `) and a
>> __Text Section__ that will tell the unwinder the trampoline bounds.
>> Then, LLDB inserts a __Function Unwinder__ in the module UnwindTable and
>> creates an __Unwind Plan__ pointing to the BreakpointLocation return address.
>> This is done taking into consideration that the trampoline will alter the
>> memory layout by spilling registers to the stack.
>> Finally, the newly created module is appended to the target image list, which
>> allows LLDB to move between the injected code and the user code seamlessly.
>> This is what the backtrace looks like after hitting the injected trap:
>> ```cpp
>> * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
>>     frame #0: 0x00000001001070b9 
>> $__lldb_expr`$__lldb_expr($__lldb_arg=0x00000001f4c71000) at 
>> lldb-ca98b7.expr:49
>>     frame #1: 0x0000000100105028 
>> $__lldb_injected_conditional_bp_trampoline`$__lldb_injected_conditional_bp_trampoline
>>  + 40
>>   * frame #2: 0x0000000100000f5b main`main at main.c:7:23
>> ```
>> For now, LLDB selects the user frame but the goal would be to mask all the
>> frames introduced by the Fast Conditional Breakpoint.
>> A `debug-injected-condition` setting will allow to stop at the FCBT and show
>> all the elided frames.
> 
> Regarding unwinding, I am wondering whether we really need to do anything 
> really special. It sounds to me that if we try a little bit harder then we 
> could make the trampoline code look very much like a signal handler, and have 
> it be treated as such. Then the only special thing we would need to do is to 
> hide the topmost trampoline code somewhere higher up in the presentation 
> layer.
> 
> I am imagining the trampoline code could look something like this (excuse my 
> bad assembly, I haven't written that in a while):
> pushq %rax
> pushq %rbx
> ...
> leaq $SIZE_OF_REGISTER_CONTEXT(%rsp), %r10 # void *registers
> movq %rsp, %r11 # void *args
> subq $SIZE_OF_ARGS, %rsp
> movq %r10, %rdi
> movq %r11, %rsi
> callq __build_args # __build_args(const void *registers, void *args)
> movq %r11, %rdi
> callq __lldb_expr # __lldb_expr(void *args)
> test %al, %al
> jz .Ldone
> trap_opcode:
> int3
> .Ldone:
> addq $SIZE_OF_ARGS, %rsp
> pop everything, execute displaced instructions and jump back
> 
> I think this trampoline is pretty similar to what you're proposing, but there 
> are a couple of subtle differences:
> - the args structure is allocated on the stack - I already spoke about that
> - the testing of the condition happens inside the trampoline
> I think this second item has several advantages. Firstly, this means that we 
> hit the breakpoint, we only have one extra frame on the stack. So even if we 
> don't do any extra work in the debugger to hide this stuff, we don't clutter 
> the stack too much.
> 
> Secondly, this means we can avoid the "dissasemble and scan for trap opcode" 
> step, which is kind of a hack -- after all, we generated these instructions, 
> so we should _know_ where the trap opcode is. This way, you can emit a 
> special symbol (trap_opcode label in the example above), that lldb can then 
> search for, and know it's location exactly.
> 

I think testing the condition inside the trampoline might be very limiting:
- The variable resolution would be need to be rethought to allow the condition 
check to happen in the trampoline.
- To be able to support different condition types (expression / thread name / 
thread id …), the $__lldb_expr is a better option IMO. In the future, we might 
also inject logging code that would only be run according to the condition.
- This feature requires at least one more frame (for your approach), that would 
still need to be hidden to the user. I don’t think hiding 2 frames is more work 
than hiding 1.

> And lastly, and this is the most important advantage IMO, is that we are in 
> full control of the kind of unwind info we generate for the trampoline. We 
> can generate the proper eh_frame info for this trampoline which would 
> correctly describe the locations of the registers of the previous frame, so 
> that lldb would automatically be able to find them and display them properly 
> when you do for instance "register read" with the parent frame selected. 
> Hopefully, all this would take is a couple of well-placed .cfi assembler 
> instructions.
> 
> Here, I'm imagining we could use the MC layer in llvm do do this thing, 
> either by feeding it a raw assembler string, or by using it's c++ api, 
> whichever is easier. Then we could feed this to the normal jit together with 
> the compiled c++ expression and it would link it all together and load it 
> into memory.
> 
>> ### Instruction Shifter (WIP)
>> Because some instructions might use operands that are at an offsets relative
>> to the program counter, copying the instructions to a new location might
>> change their meaning:
>> LLDB needs to patch each instruction with the right offset.
>> This is done using `LLVM::MCInst` tool in order to detect the instructions
>> that need to be rewritten.
>> ## Risk Mitigation
>> The optimization relies heavily on code injection, most of which is
>> architecture specific. Because of this, overwriting the instructions
>> can fail depending of the breakpoint location, e.g.:
>> - If the overwritten instructions contains indirection (branch instructions).
>> - If the overwritten instructions are a branch target.
>> - If there is not enough instructions to insert the branch instruction 
>> (x86_64)
>> If the setup process fails to insert the Fast Conditional Breakpoint, it will
>> fallback to the legacy behavior, and warn the user about what went wrong.
> 
> Another possible fallback behavior would be to still do the whole trampoline 
> stuff and everything, but avoid needing to overwrite opcodes in the target by 
> having the gdb stub do this work for us. So, we could teach the stub that 
> some addresses are special and when a breakpoint at this location gets hit, 
> it should automatically change the program counter to some other location 
> (the address of our trampoline) and let the program continue. This way, you 
> would only need to insert a single trap instruction, which is what we know 
> how to do already. And I believe this would still bring a major speedup 
> compared to the current implementation (particularly if the target is remote 
> on a high-latency link, but even in the case of local debugging, I would 
> expect maybe an order of magnitude faster processing of conditional 
> breakpoints).
> 
> This would be kind of similar to the "cond_list" in the gdb-remote 
> "Z0;addr,kind;cond_list" packet 
> <https://sourceware.org/gdb/onlinedocs/gdb/Packets.html>.
> 
> In fact, given that this "instruction shifting" is the most unpredictable 
> part of this whole architecture (because we don't control the contents of the 
> inferior instructions), it might make sense to do this approach first, and 
> then do the instruction shifting as a follow-up.
> 
>> One way to mitigate those limitations would be to use code instrumentation
>> to detect if it's safe to set a Fast Condition Breakpoint at a certain
>> location, and hint the user to move the FCB before or after the location 
>> where
>> it was set originally.
>> ## Prototype Code
>> I submitted my patches ([1](reviews.llvm.org/D66248), 
>> [2](reviews.llvm.org/D66249),
>> [3](reviews.llvm.org/D66250)) on Phabricator with the prototype.
>> ## Feedback
>> Before moving forward I'd like to get the community's input. What do you
>> think about this approach? Any feedback would be greatly appreciated!
>> Thanks,
> 
> As my last suggestion, I would like to ask you to consider testing as you're 
> writing this code. This is a pretty complex machinery you're building, and it 
> would be nice if it was possible to test pieces of it in isolation instead of 
> just the large end-to-end kinds of tests. For example, in the "instruction 
> shifter" machinery, it would be nice to be able to take a single instruction, 
> execute both in place, and in a "shifted" location, and assert that the 
> resulting register contents are identical.

Will do.

> 
> regards,
> pavel

Thanks,

Ismail.

_______________________________________________
lldb-dev mailing list
lldb-dev@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev

Re: [lldb-dev] [RFC] Fast Conditional Breakpoints (FCB)

Reply via email to