[Lldb-commits] [PATCH] D140358: [lldb-vscode] Add support for disassembly view

Enrico Loparco via Phabricator via lldb-commits Fri, 06 Jan 2023 01:50:26 -0800

eloparco added inline comments.


================
Comment at: lldb/tools/lldb-vscode/lldb-vscode.cpp:2177
+  const auto bytes_offset = -instruction_offset * bytes_per_instruction;
+  auto start_addr = base_addr - bytes_offset;
+  const auto disassemble_bytes = instruction_count * bytes_per_instruction;
----------------
clayborg wrote:
> eloparco wrote:
> > clayborg wrote:
> > > This will work if the min and max opcode byte size are the same, like for 
> > > arm64, the min and max are 4. This won't work for x86 or arm32 in thumb 
> > > mode. So when backing up, we need to do an address lookup on the address 
> > > we think we want to go to, and then adjust the starting address 
> > > accordingly. Something like:
> > > 
> > > ```
> > > SBAddress start_sbaddr = (base_addr - bytes_offset, g_vsc.target);
> > > ```
> > > now we have a section offset address that can tell us more about what it 
> > > is. We can find the SBFunction or SBSymbol for this address and use those 
> > > to find the right instructions. This will allow us to correctly 
> > > disassemble code bytes. 
> > > 
> > > We can also look at the section that the memory comes from and see what 
> > > the section contains. If the section is data, then emit something like:
> > > ```
> > > 0x00001000 .byte 0x23
> > > 0x00001001 .byte 0x34
> > > ...
> > > ```
> > > To find the section type we can do:
> > > ```
> > > SBSection section = start_sbaddr.GetSection();
> > > if (section.IsValid() && section.GetSectionType() == 
> > > lldb::eSectionTypeCode) {
> > >  // Disassemble from a valid boundary
> > > } else {
> > >   // Emit a byte or long at a time with ".byte 0xXX" or other ASM 
> > > directive for binary data
> > > }
> > > ```
> > > We need to ensure we start disassembling on the correct instruction 
> > > boundary as well as our math for "start_addr" might be in between actual 
> > > opcode boundaries. If we are in a lldb::eSectionTypeCode, then we know we 
> > > have instructions, and if we are not, then we can emit ".byte" or other 
> > > binary data directives. So if we do have lldb::eSectionTypeCode as our 
> > > section type, then we should have a function or symbol, and we can get 
> > > instructions from those objects easily:
> > > 
> > > ```
> > > if (section.IsValid() && section.GetSectionType() == 
> > > lldb::eSectionTypeCode) {
> > >  lldb::SBInstructionList instructions;
> > >  lldb::SBFunction function = start_sbaddr.GetFunction();
> > >  if (function.IsValid()) {
> > >     instructions = function.GetInstructions(g_vsc.target);
> > >  } else {
> > >     symbol = start_sbaddr.GetSymbol();
> > >     if (symbol.IsValid())
> > >       instructions = symbol.GetInstructions(g_vsc.target);
> > > }
> > > const size_t num_instrs = instructions.GetSize();
> > > if (num_instrs > 0) {
> > >   // we found instructions from a function or symbol and we need to 
> > >   // find the matching instruction that we want to start from by iterating
> > >   // over the instructions and finding the right address
> > >   size_t matching_idx = num_instrs; // Invalid index
> > >   for (size_t i=0; i<num_instrs; ++i) {
> > >     lldb::SBInstruction inst = instructions.GetInstructionAtIndex(i);
> > >     if (inst.GetAddress().GetLoadAddress(g_vsc.target) >= start_addr) {
> > >       matching_idx = i;
> > >       break;
> > >     }
> > >   }
> > >   if (matching_idx < num_instrs) {
> > >     // Now we can print the instructions from [matching_idx, num_instrs)
> > >     // then we need to repeat the search for the next function or symbol. 
> > >     // note there may be bytes between functions or symbols which we can 
> > > disassemble
> > >     // by calling _get_instructions_from_memory(...) but we must find the 
> > > next
> > >     // symbol or function boundary and get back on track
> > >   }
> > >   
> > > ```
> > Sorry, I should have provided a proper explanation.
> > 
> > I use the maximum instruction size as the "worst case". Basically, I need 
> > to read a portion of memory but I do not know the start address and the 
> > size. For the start address, if I want to read N instructions before 
> > `base_addr` I need to read at least starting from `base_addr - N * 
> > max_instruction_size`: if all instructions are of size 
> > `max_instruction_size` I will read exactly N instructions; otherwise I will 
> > read more than N instructions and prune the additional ones afterwards. 
> > Same for applies for the size.
> > 
> > Since `start_addr` is based on a "worst case", it may be an address in the 
> > middle of an instruction. In that case, that first instruction will be 
> > misinterpreted, but I think that is negligible.
> > 
> > The logic is similar to what is implemented in other VS Code extensions, 
> > like 
> > https://github.com/vadimcn/vscode-lldb/blob/master/adapter/src/debug_session.rs#L1134.
> > 
> > Does it make sense?
> The issue is, you might end up backing up by N bytes, and you might not end 
> up on an opcode boundary. Lets say you have x86 disassembly like:
> 
> ```
>     0x100002e37 <+183>: 48 8b 4d f8              movq   -0x8(%rbp), %rcx
>     0x100002e3b <+187>: 48 39 c8                 cmpq   %rcx, %rax
>     0x100002e3e <+190>: 0f 85 09 00 00 00        jne    0x100002e4d           
>     ; <+205> at main.cpp
>     0x100002e44 <+196>: 8b 45 94                 movl   -0x6c(%rbp), %eax
>     0x100002e47 <+199>: 48 83 c4 70              addq   $0x70, %rsp
>     0x100002e4b <+203>: 5d                       popq   %rbp
>     0x100002e4c <+204>: c3                       retq   
>     0x100002e4d <+205>: e8 7e 0f 00 00           callq  0x100003dd0           
>     ; symbol stub for: __stack_chk_fail
>     0x100002e52 <+210>: 0f 0b                    ud2    
> ```
> 
> Let's say you started with the address 0x100002e4c, and backed up by the max 
> opcode size of 15 for x86_64, that would be 0x100002e3d. You would start 
> disassembling on a non opcode boundary as this is the last byte of the 3 byte 
> opcode at 0x100002e3b (0xc8). And this would disassembly incorrectly. So we 
> need  to make use of the functions or symbol boundaries to ensure we are 
> disassembling correctly. If we have no function or symbol, we can do our 
> best. But as you can see we would get things completely wrong in this case 
> and we need to fix this as detailed.
Actually, I was mislead by the fact that so far I tried on both:
- my ARM machine, with instructions of fixed size (4)
- with [WAMR](https://github.com/bytecodealliance/wasm-micro-runtime) 
disassembling to WASM, where, when the start address is in the middle of an 
instruction, only that first instruction is misinterpreted

Without going into sections and symbols as you were proposing, the solution can 
be as easy as done in this VS Code Extension: 
https://github.com/vadimcn/vscode-lldb/commit/28ae4f4bf3bd29a0c84dd586cd5360836210ab51.

I'll update the code and put some additional comments to clarify the logic. Let 
me know what you think.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D140358/new/

https://reviews.llvm.org/D140358

_______________________________________________
lldb-commits mailing list
lldb-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits

[Lldb-commits] [PATCH] D140358: [lldb-vscode] Add support for disassembly view

Reply via email to