Re: [lldb-dev] [RFC] Fast Conditional Breakpoints (FCB)

2019-08-15 Thread Pavel Labath via lldb-dev
Hello Ismail, and wellcome to LLDB. You have a very interesting (and not 
entirely trivial) project, and I wish you the best of luck in your work. 
I think this will be a very useful addition to lldb.


It sounds like you have researched the problem very well, and the 
overall direction looks good to me. However, I do have some ideas 
suggestions about possible tweaks/improvements that I would like to hear 
your thoughts on. Please find my comments inline.


On 14/08/2019 22:52, Ismail Bennani via lldb-dev wrote:

Hi everyone,

I’m Ismail, a compiler engineer intern at Apple. As a part of my internship,
I'm adding Fast Conditional Breakpoints to LLDB, using code patching.

Currently, the expressions that power conditional breakpoints are lowered
to LLVM IR and LLDB knows how to interpret a subset of it. If that fails,
the debugger JIT-compiles the expression (compiled once, and re-run on each
breakpoint hit). In both cases LLDB must collect all program state used in
the condition and pass it to the expression.

The goal of my internship project is to make conditional breakpoints faster by:

1. Compiling the expression ahead-of-time, when setting the breakpoint and
inject into the inferior memory only once.
2. Re-route the inferior execution flow to run the expression and check whether
it needs to stop, in-process.

This saves the cost of having to do the context switch between debugger and
the inferior program (about 10 times) to compile and evaluate the condition.

This feature is described on the [LLDB Project 
page](https://lldb.llvm.org/status/projects.html#use-the-jit-to-speed-up-conditional-breakpoint-evaluation).
The goal would be to have it working for most languages and architectures
supported by LLDB, however my original implementation will be for C-based
languages targeting x86_64. It will be extended to AArch64 afterwards.

Note the way my prototype is implemented makes it fully extensible for other
languages and architectures.

## High Level Design

Every time a breakpoint that holds a condition is hit, multiple context
switches are needed in order to compile and evaluate the condition.

First, the breakpoint is hit and the control is given to the debugger.
That's where LLDB wraps the condition expression into a UserExpression that
will get compiled and injected into the program memory. Another round-trip
between the inferior and the LLDB is needed to run the compiled expression
and extract the expression results that will tell LLDB to stop or not.

To get rid of those context switches, we will evaluate the condition inside
the program, and only stop when the condition is true. LLDB will achieve this
by inserting a jump from the breakpoint address to a code section that will
be allocated into the program memory. It will save the thread state, run the
condition expression, restore the thread state and then execute the copied
instruction(s) before jumping back to the regular program flow.
Then we only trap and return control to LLDB when the condition is true.

## Implementation Details

To be able to evaluate a breakpoint condition without interacting with the
debugger, LLDB changes the inferior program execution flow by overwriting
the instruction at which the breakpoint was set with a branching instruction.

The original instruction(s) are copied to a memory stub allocated in the
inferior program memory called the __Fast Conditional Breakpoint Trampoline__
or __FCBT__. The FCBT will allow us the re-route the program execution flow to
check the condition in-process while preserving the original program behavior.
This part is critical to setup Fast Conditional Breakpoints.

```
   Inferior Binary Trampoline

|.|  +-+
|.|  | |
|.|   +->+   Save RegisterContext  |
|.|   |  | |
+-+   |  +-+
| |   |  | |
|   Instruction   |   |  |  Build Arguments Struct |
| |   |  | |
+-+   |  +-+
| +---+  | |
|   Branch to Trampoline  |  |  Call Condition Checker |
| +<--+  | |
+-+   |  +-+
| |   |  | |
|   Instruction   |   |  | Restore RegisterContext |
| |   |  | |
+-+

Re: [lldb-dev] [RFC] Fast Conditional Breakpoints (FCB)

2019-08-15 Thread Jim Ingham via lldb-dev
Thanks for your great comments.  A few replies...

> On Aug 15, 2019, at 10:10 AM, Pavel Labath via lldb-dev 
>  wrote:
> 
> Hello Ismail, and wellcome to LLDB. You have a very interesting (and not 
> entirely trivial) project, and I wish you the best of luck in your work. I 
> think this will be a very useful addition to lldb.
> 
> It sounds like you have researched the problem very well, and the overall 
> direction looks good to me. However, I do have some ideas suggestions about 
> possible tweaks/improvements that I would like to hear your thoughts on. 
> Please find my comments inline.
> 
> On 14/08/2019 22:52, Ismail Bennani via lldb-dev wrote:
>> Hi everyone,
>> I’m Ismail, a compiler engineer intern at Apple. As a part of my internship,
>> I'm adding Fast Conditional Breakpoints to LLDB, using code patching.
>> Currently, the expressions that power conditional breakpoints are lowered
>> to LLVM IR and LLDB knows how to interpret a subset of it. If that fails,
>> the debugger JIT-compiles the expression (compiled once, and re-run on each
>> breakpoint hit). In both cases LLDB must collect all program state used in
>> the condition and pass it to the expression.
>> The goal of my internship project is to make conditional breakpoints faster 
>> by:
>> 1. Compiling the expression ahead-of-time, when setting the breakpoint and
>>inject into the inferior memory only once.
>> 2. Re-route the inferior execution flow to run the expression and check 
>> whether
>>it needs to stop, in-process.
>> This saves the cost of having to do the context switch between debugger and
>> the inferior program (about 10 times) to compile and evaluate the condition.
>> This feature is described on the [LLDB Project 
>> page](https://lldb.llvm.org/status/projects.html#use-the-jit-to-speed-up-conditional-breakpoint-evaluation).
>> The goal would be to have it working for most languages and architectures
>> supported by LLDB, however my original implementation will be for C-based
>> languages targeting x86_64. It will be extended to AArch64 afterwards.
>> Note the way my prototype is implemented makes it fully extensible for other
>> languages and architectures.
>> ## High Level Design
>> Every time a breakpoint that holds a condition is hit, multiple context
>> switches are needed in order to compile and evaluate the condition.
>> First, the breakpoint is hit and the control is given to the debugger.
>> That's where LLDB wraps the condition expression into a UserExpression that
>> will get compiled and injected into the program memory. Another round-trip
>> between the inferior and the LLDB is needed to run the compiled expression
>> and extract the expression results that will tell LLDB to stop or not.
>> To get rid of those context switches, we will evaluate the condition inside
>> the program, and only stop when the condition is true. LLDB will achieve this
>> by inserting a jump from the breakpoint address to a code section that will
>> be allocated into the program memory. It will save the thread state, run the
>> condition expression, restore the thread state and then execute the copied
>> instruction(s) before jumping back to the regular program flow.
>> Then we only trap and return control to LLDB when the condition is true.
>> ## Implementation Details
>> To be able to evaluate a breakpoint condition without interacting with the
>> debugger, LLDB changes the inferior program execution flow by overwriting
>> the instruction at which the breakpoint was set with a branching instruction.
>> The original instruction(s) are copied to a memory stub allocated in the
>> inferior program memory called the __Fast Conditional Breakpoint Trampoline__
>> or __FCBT__. The FCBT will allow us the re-route the program execution flow 
>> to
>> check the condition in-process while preserving the original program 
>> behavior.
>> This part is critical to setup Fast Conditional Breakpoints.
>> ```
>>   Inferior Binary Trampoline
>> |.|  +-+
>> |.|  | |
>> |.|   +->+   Save RegisterContext  |
>> |.|   |  | |
>> +-+   |  +-+
>> | |   |  | |
>> |   Instruction   |   |  |  Build Arguments Struct |
>> | |   |  | |
>> +-+   |  +-+
>> | +---+  | |
>> |   Branch to Trampoline  |  |  Call Condition Checker |
>> | +<--+  | |
>> +--

Re: [lldb-dev] [RFC] Fast Conditional Breakpoints (FCB)

2019-08-15 Thread Pavel Labath via lldb-dev

On 15/08/2019 20:15, Jim Ingham wrote:

Thanks for your great comments.  A few replies...


On Aug 15, 2019, at 10:10 AM, Pavel Labath via lldb-dev 
 wrote:
I am wondering whether we really need to involve the memory allocation 
functions here. What's the size of this address structure? I would expect it to 
be relatively small compared to the size of the entire register context that we 
have just saved to the stack. If that's the case, the case then maybe we could 
have the trampoline allocate some space on the stack and pass that as an 
argument to the $__lldb_arg building code.


You have no guarantee that only one thread is running this code at any given 
time.  So you would have to put a mutex in the condition to guard the use of 
this stack allocation.  That's not impossible but it means you're changing 
threading behavior.  Calling the system allocator might take a lock but a lot 
of allocation systems can hand out small allocations without locking, so it 
might be simpler to just take advantage of that.


I am sorry, but I am confused. I am suggesting we take a slice of the 
stack from the thread that happened to hit that breakpoint, and use that 
memory for the __lldb_arg structure for the purpose of evaluating the 
condition on that very thread. If two threads hit the breakpoint 
simultaneously, then we just allocate two chunks of memory on their 
respective stacks. Or am I misunderstanding something about how this 
structure is supposed to be used?





Another possible fallback behavior would be to still do the whole trampoline 
stuff and everything, but avoid needing to overwrite opcodes in the target by 
having the gdb stub do this work for us. So, we could teach the stub that some 
addresses are special and when a breakpoint at this location gets hit, it 
should automatically change the program counter to some other location (the 
address of our trampoline) and let the program continue. This way, you would 
only need to insert a single trap instruction, which is what we know how to do 
already. And I believe this would still bring a major speedup compared to the 
current implementation (particularly if the target is remote on a high-latency 
link, but even in the case of local debugging, I would expect maybe an order of 
magnitude faster processing of conditional breakpoints).


This is a clever idea.  It would also mean that you wouldn't have to figure out 
how to do register saves and restores in code, since debugserver already knows 
how to do that, and once you are stopped it is probably not much slower to have 
debugserver do that job than have the trampoline do it.  It also has the 
advantage that you don't need to deal with the problem where the space that you 
are able to allocate for the trampoline code is too far away from the code you 
are patching for a simple jump.  It would certainly be worth seeing how much 
faster this makes conditions.


I actually thought we would use the exact same trampoline that would be 
used for the full solution (so it would do the register saves, restores, 
etc), and the stub would only help us to avoid trampling over a long 
sequence of instructions. But other solutions are certainly possible too...




Unless I'm missing something you would still need two traps.  One in the main instruction stream and one to stop when the condition is true.  But maybe you meant "a single kind of insertion - a trap" not  "a single trap instruction" 


I meant "a single in the application's instruction stream". The counts 
of traps in the code that we generate aren't that important, as we can 
do what we want there. But if we insert just a single trap opcode, then 
we are guaranteed to overwrite only one instruction, which means the 
whole "are we overwriting a jump target" discussion becomes moot. OTOH, 
if we write a full jump code then we can overwrite a *lot* of 
instructions -- the shortest sequence that can jump anywhere in the 
address space I can think of is something like pushq %rax; movabsq 
$WHATEVER, %rax; jmpq *%rax. Something as big as that is fairly likely 
to overwrite a jump target.



...




This would be kind of similar to the "cond_list" in the gdb-remote 
"Z0;addr,kind;cond_list" packet .

In fact, given that this "instruction shifting" is the most unpredictable part 
of this whole architecture (because we don't control the contents of the inferior 
instructions), it might make sense to do this approach first, and then do the instruction 
shifting as a follow-up.


One side-benefit we are trying to get out of the instruction shifting approach is not 
having to stop all threads when inserting breakpoints as often as possible.  Since we can 
inject thread ID tests into the condition as well, doing the instruction shifting would 
mean you could specify thread-specific breakpoints, and then ONLY the threads that match 
the thread specification would ever have to be stopped.  You could also hav

Re: [lldb-dev] [RFC] Fast Conditional Breakpoints (FCB)

2019-08-15 Thread Jim Ingham via lldb-dev


> On Aug 15, 2019, at 11:55 AM, Pavel Labath  wrote:
> 
> On 15/08/2019 20:15, Jim Ingham wrote:
>> Thanks for your great comments.  A few replies...
>>> On Aug 15, 2019, at 10:10 AM, Pavel Labath via lldb-dev 
>>>  wrote:
>>> I am wondering whether we really need to involve the memory allocation 
>>> functions here. What's the size of this address structure? I would expect 
>>> it to be relatively small compared to the size of the entire register 
>>> context that we have just saved to the stack. If that's the case, the case 
>>> then maybe we could have the trampoline allocate some space on the stack 
>>> and pass that as an argument to the $__lldb_arg building code.
>> You have no guarantee that only one thread is running this code at any given 
>> time.  So you would have to put a mutex in the condition to guard the use of 
>> this stack allocation.  That's not impossible but it means you're changing 
>> threading behavior.  Calling the system allocator might take a lock but a 
>> lot of allocation systems can hand out small allocations without locking, so 
>> it might be simpler to just take advantage of that.
> 
> I am sorry, but I am confused. I am suggesting we take a slice of the stack 
> from the thread that happened to hit that breakpoint, and use that memory for 
> the __lldb_arg structure for the purpose of evaluating the condition on that 
> very thread. If two threads hit the breakpoint simultaneously, then we just 
> allocate two chunks of memory on their respective stacks. Or am I 
> misunderstanding something about how this structure is supposed to be used?
> 

Right, I missed that you were suggesting this on the stack - somehow I thought 
you meant in the allocation we made when setting up the condition.

> 
>>> 
>>> Another possible fallback behavior would be to still do the whole 
>>> trampoline stuff and everything, but avoid needing to overwrite opcodes in 
>>> the target by having the gdb stub do this work for us. So, we could teach 
>>> the stub that some addresses are special and when a breakpoint at this 
>>> location gets hit, it should automatically change the program counter to 
>>> some other location (the address of our trampoline) and let the program 
>>> continue. This way, you would only need to insert a single trap 
>>> instruction, which is what we know how to do already. And I believe this 
>>> would still bring a major speedup compared to the current implementation 
>>> (particularly if the target is remote on a high-latency link, but even in 
>>> the case of local debugging, I would expect maybe an order of magnitude 
>>> faster processing of conditional breakpoints).
>> This is a clever idea.  It would also mean that you wouldn't have to figure 
>> out how to do register saves and restores in code, since debugserver already 
>> knows how to do that, and once you are stopped it is probably not much 
>> slower to have debugserver do that job than have the trampoline do it.  It 
>> also has the advantage that you don't need to deal with the problem where 
>> the space that you are able to allocate for the trampoline code is too far 
>> away from the code you are patching for a simple jump.  It would certainly 
>> be worth seeing how much faster this makes conditions.
> 
> I actually thought we would use the exact same trampoline that would be used 
> for the full solution (so it would do the register saves, restores, etc), and 
> the stub would only help us to avoid trampling over a long sequence of 
> instructions. But other solutions are certainly possible too...

Sure, I was thinking that if you were allowing debugserver to intervene, you 
could just inject something like:

void handler(void) {
   args = arg_builder()
   condition_function(args);
   trap;
}

and let debugserver do:

save registers
set pc to handler
continue
hit a trap
decide whether this was the condition function trap or the handler trap
restore registers
set pc back to trap instruction
return control to lldb or single step over the instruction and continue

That way you wouldn't have to mess with injecting the trampoline, unwinding, 
etc.  This should be pretty easy to hack up, and see how much faster this was.

> 
>> Unless I'm missing something you would still need two traps.  One in the 
>> main instruction stream and one to stop when the condition is true.  But 
>> maybe you meant "a single kind of insertion - a trap" not  "a single trap 
>> instruction" 
> 
> I meant "a single in the application's instruction stream". The counts of 
> traps in the code that we generate aren't that important, as we can do what 
> we want there. But if we insert just a single trap opcode, then we are 
> guaranteed to overwrite only one instruction, which means the whole "are we 
> overwriting a jump target" discussion becomes moot. OTOH, if we write a full 
> jump code then we can overwrite a *lot* of instructions -- the shortest 
> sequence that can jump anywhere in the address space I can think of is 
> som

Re: [lldb-dev] [RFC] Fast Conditional Breakpoints (FCB)

2019-08-15 Thread Ismail Bennani via lldb-dev
I built Clang (and LLVM) in Release Mode with Debug Info (-O2),
and got these results:

|   Dwarf Occurences   |Occurences   |
|--|-|
| DW\_OP\_deref|1,570|
| DW\_OP\_const|3,791|
| DW\_OP\_addr |9,528|
| DW\_OP\_lit  |62,826   |
| DW\_OP\_fbreg|205,382  |
| DW\_OP\_piece|242,888  |
| DW\_OP\_stack\_value |992,261  |
| DW\_OP\_breg |1,006,070|
| DW\_OP\_reg  |5,175,831|
| **Total**|  **7,700,147**  |


I could technically implement the logic to support DW_OP_reg, DW_OP_breg
and DW_OP_stack_value fairly easily (which still represents 90% of all ops).

However, DW_OP_piece is a more complex operation since it combines
several other operations, and would require more work.

This would also imply that there will 2 DWARF Expression Interpreter in
LLDB, hence twice as much code to maintain … I’ll try to see if I can
use the existing interpreter for this feature.

Ismail

> On Aug 14, 2019, at 3:42 PM, Finkel, Hal J.  wrote:
> 
> 
> On 8/14/19 3:52 PM, Ismail Bennani via lldb-dev wrote:
>> Hi everyone,
>> 
>> I’m Ismail, a compiler engineer intern at Apple. As a part of my internship,
>> I'm adding Fast Conditional Breakpoints to LLDB, using code patching.
>> 
>> ...
>> 
>> Since all the registers are already mapped to a structure, I should
>> be able to support more __DWARF Operations__ in the future.
>> 
>> After collecting some metrics on the __Clang__ binary, built at __-O0__,
>> the debug info shows that __99%__ of the most used DWARF Operations are :
>> 
>> |DWARF Operation| Occurrences   |
>> |---|---|
>> |DW\_OP_fbreg   | 2 114 612 |
>> |DW\_OP_reg |   820 548 |
>> |DW\_OP_constu  |   267 450 |
>> |DW\_OP_addr|17 370 |
>> 
>> |   __Top 4__   | __3 219 980 Occurrences__ |
>> |---|---|
>> |   __Total__   | __3 236 859 Occurrences__ |
>> 
>> Those 4 operations are the one that I'll support for now.
>> To support more complex expressions, we would need to JIT-compile
>> a DWARF expression interpreter.
> 
> 
> First, this all sounds really useful.
> 
> Out of curiosity, how do these statistics change if you compile Clang 
> with -O1? Many of my users need to debug slightly-optimized code.
> 
>  -Hal
> 
> 
> -- 
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
> 

___
lldb-dev mailing list
lldb-dev@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev