[Lldb-commits] [PATCH] D140358: [lldb-vscode] Add support for disassembly view

Enrico Loparco via Phabricator via lldb-commits Wed, 11 Jan 2023 16:46:16 -0800

eloparco added inline comments.


================
Comment at: lldb/tools/lldb-vscode/lldb-vscode.cpp:2177
+  const auto bytes_offset = -instruction_offset * bytes_per_instruction;
+  auto start_addr = base_addr - bytes_offset;
+  const auto disassemble_bytes = instruction_count * bytes_per_instruction;
----------------
eloparco wrote:
> clayborg wrote:
> > eloparco wrote:
> > > clayborg wrote:
> > > > This will work if the min and max opcode byte size are the same, like 
> > > > for arm64, the min and max are 4. This won't work for x86 or arm32 in 
> > > > thumb mode. So when backing up, we need to do an address lookup on the 
> > > > address we think we want to go to, and then adjust the starting address 
> > > > accordingly. Something like:
> > > > 
> > > > ```
> > > > SBAddress start_sbaddr = (base_addr - bytes_offset, g_vsc.target);
> > > > ```
> > > > now we have a section offset address that can tell us more about what 
> > > > it is. We can find the SBFunction or SBSymbol for this address and use 
> > > > those to find the right instructions. This will allow us to correctly 
> > > > disassemble code bytes. 
> > > > 
> > > > We can also look at the section that the memory comes from and see what 
> > > > the section contains. If the section is data, then emit something like:
> > > > ```
> > > > 0x00001000 .byte 0x23
> > > > 0x00001001 .byte 0x34
> > > > ...
> > > > ```
> > > > To find the section type we can do:
> > > > ```
> > > > SBSection section = start_sbaddr.GetSection();
> > > > if (section.IsValid() && section.GetSectionType() == 
> > > > lldb::eSectionTypeCode) {
> > > >  // Disassemble from a valid boundary
> > > > } else {
> > > >   // Emit a byte or long at a time with ".byte 0xXX" or other ASM 
> > > > directive for binary data
> > > > }
> > > > ```
> > > > We need to ensure we start disassembling on the correct instruction 
> > > > boundary as well as our math for "start_addr" might be in between 
> > > > actual opcode boundaries. If we are in a lldb::eSectionTypeCode, then 
> > > > we know we have instructions, and if we are not, then we can emit 
> > > > ".byte" or other binary data directives. So if we do have 
> > > > lldb::eSectionTypeCode as our section type, then we should have a 
> > > > function or symbol, and we can get instructions from those objects 
> > > > easily:
> > > > 
> > > > ```
> > > > if (section.IsValid() && section.GetSectionType() == 
> > > > lldb::eSectionTypeCode) {
> > > >  lldb::SBInstructionList instructions;
> > > >  lldb::SBFunction function = start_sbaddr.GetFunction();
> > > >  if (function.IsValid()) {
> > > >     instructions = function.GetInstructions(g_vsc.target);
> > > >  } else {
> > > >     symbol = start_sbaddr.GetSymbol();
> > > >     if (symbol.IsValid())
> > > >       instructions = symbol.GetInstructions(g_vsc.target);
> > > > }
> > > > const size_t num_instrs = instructions.GetSize();
> > > > if (num_instrs > 0) {
> > > >   // we found instructions from a function or symbol and we need to 
> > > >   // find the matching instruction that we want to start from by 
> > > > iterating
> > > >   // over the instructions and finding the right address
> > > >   size_t matching_idx = num_instrs; // Invalid index
> > > >   for (size_t i=0; i<num_instrs; ++i) {
> > > >     lldb::SBInstruction inst = instructions.GetInstructionAtIndex(i);
> > > >     if (inst.GetAddress().GetLoadAddress(g_vsc.target) >= start_addr) {
> > > >       matching_idx = i;
> > > >       break;
> > > >     }
> > > >   }
> > > >   if (matching_idx < num_instrs) {
> > > >     // Now we can print the instructions from [matching_idx, num_instrs)
> > > >     // then we need to repeat the search for the next function or 
> > > > symbol. 
> > > >     // note there may be bytes between functions or symbols which we 
> > > > can disassemble
> > > >     // by calling _get_instructions_from_memory(...) but we must find 
> > > > the next
> > > >     // symbol or function boundary and get back on track
> > > >   }
> > > >   
> > > > ```
> > > Sorry, I should have provided a proper explanation.
> > > 
> > > I use the maximum instruction size as the "worst case". Basically, I need 
> > > to read a portion of memory but I do not know the start address and the 
> > > size. For the start address, if I want to read N instructions before 
> > > `base_addr` I need to read at least starting from `base_addr - N * 
> > > max_instruction_size`: if all instructions are of size 
> > > `max_instruction_size` I will read exactly N instructions; otherwise I 
> > > will read more than N instructions and prune the additional ones 
> > > afterwards. Same for applies for the size.
> > > 
> > > Since `start_addr` is based on a "worst case", it may be an address in 
> > > the middle of an instruction. In that case, that first instruction will 
> > > be misinterpreted, but I think that is negligible.
> > > 
> > > The logic is similar to what is implemented in other VS Code extensions, 
> > > like 
> > > https://github.com/vadimcn/vscode-lldb/blob/master/adapter/src/debug_session.rs#L1134.
> > > 
> > > Does it make sense?
> > The issue is, you might end up backing up by N bytes, and you might not end 
> > up on an opcode boundary. Lets say you have x86 disassembly like:
> > 
> > ```
> >     0x100002e37 <+183>: 48 8b 4d f8              movq   -0x8(%rbp), %rcx
> >     0x100002e3b <+187>: 48 39 c8                 cmpq   %rcx, %rax
> >     0x100002e3e <+190>: 0f 85 09 00 00 00        jne    0x100002e4d         
> >       ; <+205> at main.cpp
> >     0x100002e44 <+196>: 8b 45 94                 movl   -0x6c(%rbp), %eax
> >     0x100002e47 <+199>: 48 83 c4 70              addq   $0x70, %rsp
> >     0x100002e4b <+203>: 5d                       popq   %rbp
> >     0x100002e4c <+204>: c3                       retq   
> >     0x100002e4d <+205>: e8 7e 0f 00 00           callq  0x100003dd0         
> >       ; symbol stub for: __stack_chk_fail
> >     0x100002e52 <+210>: 0f 0b                    ud2    
> > ```
> > 
> > Let's say you started with the address 0x100002e4c, and backed up by the 
> > max opcode size of 15 for x86_64, that would be 0x100002e3d. You would 
> > start disassembling on a non opcode boundary as this is the last byte of 
> > the 3 byte opcode at 0x100002e3b (0xc8). And this would disassembly 
> > incorrectly. So we need  to make use of the functions or symbol boundaries 
> > to ensure we are disassembling correctly. If we have no function or symbol, 
> > we can do our best. But as you can see we would get things completely wrong 
> > in this case and we need to fix this as detailed.
> Actually, I was mislead by the fact that so far I tried on both:
> - my ARM machine, with instructions of fixed size (4)
> - with [WAMR](https://github.com/bytecodealliance/wasm-micro-runtime) 
> disassembling to WASM, where, when the start address is in the middle of an 
> instruction, only that first instruction is misinterpreted
> 
> Without going into sections and symbols as you were proposing, the solution 
> can be as easy as done in this VS Code Extension: 
> https://github.com/vadimcn/vscode-lldb/commit/28ae4f4bf3bd29a0c84dd586cd5360836210ab51.
> 
> I'll update the code and put some additional comments to clarify the logic. 
> Let me know what you think.
> Actually, I was mislead by the fact that so far I tried on both:
> - my ARM machine, with instructions of fixed size (4)
> - with [WAMR](https://github.com/bytecodealliance/wasm-micro-runtime) 
> disassembling to WASM, where, when the start address is in the middle of an 
> instruction, only that first instruction is misinterpreted
> 
> Without going into sections and symbols as you were proposing, the solution 
> can be as easy as done in this VS Code Extension: 
> https://github.com/vadimcn/vscode-lldb/commit/28ae4f4bf3bd29a0c84dd586cd5360836210ab51.
> 
> I'll update the code and put some additional comments to clarify the logic. 
> Let me know what you think.




================
Comment at: lldb/tools/lldb-vscode/lldb-vscode.cpp:2177
+  const auto bytes_offset = -instruction_offset * bytes_per_instruction;
+  auto start_addr = base_addr - bytes_offset;
+  const auto disassemble_bytes = instruction_count * bytes_per_instruction;
----------------
eloparco wrote:
> eloparco wrote:
> > clayborg wrote:
> > > eloparco wrote:
> > > > clayborg wrote:
> > > > > This will work if the min and max opcode byte size are the same, like 
> > > > > for arm64, the min and max are 4. This won't work for x86 or arm32 in 
> > > > > thumb mode. So when backing up, we need to do an address lookup on 
> > > > > the address we think we want to go to, and then adjust the starting 
> > > > > address accordingly. Something like:
> > > > > 
> > > > > ```
> > > > > SBAddress start_sbaddr = (base_addr - bytes_offset, g_vsc.target);
> > > > > ```
> > > > > now we have a section offset address that can tell us more about what 
> > > > > it is. We can find the SBFunction or SBSymbol for this address and 
> > > > > use those to find the right instructions. This will allow us to 
> > > > > correctly disassemble code bytes. 
> > > > > 
> > > > > We can also look at the section that the memory comes from and see 
> > > > > what the section contains. If the section is data, then emit 
> > > > > something like:
> > > > > ```
> > > > > 0x00001000 .byte 0x23
> > > > > 0x00001001 .byte 0x34
> > > > > ...
> > > > > ```
> > > > > To find the section type we can do:
> > > > > ```
> > > > > SBSection section = start_sbaddr.GetSection();
> > > > > if (section.IsValid() && section.GetSectionType() == 
> > > > > lldb::eSectionTypeCode) {
> > > > >  // Disassemble from a valid boundary
> > > > > } else {
> > > > >   // Emit a byte or long at a time with ".byte 0xXX" or other ASM 
> > > > > directive for binary data
> > > > > }
> > > > > ```
> > > > > We need to ensure we start disassembling on the correct instruction 
> > > > > boundary as well as our math for "start_addr" might be in between 
> > > > > actual opcode boundaries. If we are in a lldb::eSectionTypeCode, then 
> > > > > we know we have instructions, and if we are not, then we can emit 
> > > > > ".byte" or other binary data directives. So if we do have 
> > > > > lldb::eSectionTypeCode as our section type, then we should have a 
> > > > > function or symbol, and we can get instructions from those objects 
> > > > > easily:
> > > > > 
> > > > > ```
> > > > > if (section.IsValid() && section.GetSectionType() == 
> > > > > lldb::eSectionTypeCode) {
> > > > >  lldb::SBInstructionList instructions;
> > > > >  lldb::SBFunction function = start_sbaddr.GetFunction();
> > > > >  if (function.IsValid()) {
> > > > >     instructions = function.GetInstructions(g_vsc.target);
> > > > >  } else {
> > > > >     symbol = start_sbaddr.GetSymbol();
> > > > >     if (symbol.IsValid())
> > > > >       instructions = symbol.GetInstructions(g_vsc.target);
> > > > > }
> > > > > const size_t num_instrs = instructions.GetSize();
> > > > > if (num_instrs > 0) {
> > > > >   // we found instructions from a function or symbol and we need to 
> > > > >   // find the matching instruction that we want to start from by 
> > > > > iterating
> > > > >   // over the instructions and finding the right address
> > > > >   size_t matching_idx = num_instrs; // Invalid index
> > > > >   for (size_t i=0; i<num_instrs; ++i) {
> > > > >     lldb::SBInstruction inst = instructions.GetInstructionAtIndex(i);
> > > > >     if (inst.GetAddress().GetLoadAddress(g_vsc.target) >= start_addr) 
> > > > > {
> > > > >       matching_idx = i;
> > > > >       break;
> > > > >     }
> > > > >   }
> > > > >   if (matching_idx < num_instrs) {
> > > > >     // Now we can print the instructions from [matching_idx, 
> > > > > num_instrs)
> > > > >     // then we need to repeat the search for the next function or 
> > > > > symbol. 
> > > > >     // note there may be bytes between functions or symbols which we 
> > > > > can disassemble
> > > > >     // by calling _get_instructions_from_memory(...) but we must find 
> > > > > the next
> > > > >     // symbol or function boundary and get back on track
> > > > >   }
> > > > >   
> > > > > ```
> > > > Sorry, I should have provided a proper explanation.
> > > > 
> > > > I use the maximum instruction size as the "worst case". Basically, I 
> > > > need to read a portion of memory but I do not know the start address 
> > > > and the size. For the start address, if I want to read N instructions 
> > > > before `base_addr` I need to read at least starting from `base_addr - N 
> > > > * max_instruction_size`: if all instructions are of size 
> > > > `max_instruction_size` I will read exactly N instructions; otherwise I 
> > > > will read more than N instructions and prune the additional ones 
> > > > afterwards. Same for applies for the size.
> > > > 
> > > > Since `start_addr` is based on a "worst case", it may be an address in 
> > > > the middle of an instruction. In that case, that first instruction will 
> > > > be misinterpreted, but I think that is negligible.
> > > > 
> > > > The logic is similar to what is implemented in other VS Code 
> > > > extensions, like 
> > > > https://github.com/vadimcn/vscode-lldb/blob/master/adapter/src/debug_session.rs#L1134.
> > > > 
> > > > Does it make sense?
> > > The issue is, you might end up backing up by N bytes, and you might not 
> > > end up on an opcode boundary. Lets say you have x86 disassembly like:
> > > 
> > > ```
> > >     0x100002e37 <+183>: 48 8b 4d f8              movq   -0x8(%rbp), %rcx
> > >     0x100002e3b <+187>: 48 39 c8                 cmpq   %rcx, %rax
> > >     0x100002e3e <+190>: 0f 85 09 00 00 00        jne    0x100002e4d       
> > >         ; <+205> at main.cpp
> > >     0x100002e44 <+196>: 8b 45 94                 movl   -0x6c(%rbp), %eax
> > >     0x100002e47 <+199>: 48 83 c4 70              addq   $0x70, %rsp
> > >     0x100002e4b <+203>: 5d                       popq   %rbp
> > >     0x100002e4c <+204>: c3                       retq   
> > >     0x100002e4d <+205>: e8 7e 0f 00 00           callq  0x100003dd0       
> > >         ; symbol stub for: __stack_chk_fail
> > >     0x100002e52 <+210>: 0f 0b                    ud2    
> > > ```
> > > 
> > > Let's say you started with the address 0x100002e4c, and backed up by the 
> > > max opcode size of 15 for x86_64, that would be 0x100002e3d. You would 
> > > start disassembling on a non opcode boundary as this is the last byte of 
> > > the 3 byte opcode at 0x100002e3b (0xc8). And this would disassembly 
> > > incorrectly. So we need  to make use of the functions or symbol 
> > > boundaries to ensure we are disassembling correctly. If we have no 
> > > function or symbol, we can do our best. But as you can see we would get 
> > > things completely wrong in this case and we need to fix this as detailed.
> > Actually, I was mislead by the fact that so far I tried on both:
> > - my ARM machine, with instructions of fixed size (4)
> > - with [WAMR](https://github.com/bytecodealliance/wasm-micro-runtime) 
> > disassembling to WASM, where, when the start address is in the middle of an 
> > instruction, only that first instruction is misinterpreted
> > 
> > Without going into sections and symbols as you were proposing, the solution 
> > can be as easy as done in this VS Code Extension: 
> > https://github.com/vadimcn/vscode-lldb/commit/28ae4f4bf3bd29a0c84dd586cd5360836210ab51.
> > 
> > I'll update the code and put some additional comments to clarify the logic. 
> > Let me know what you think.
> > Actually, I was mislead by the fact that so far I tried on both:
> > - my ARM machine, with instructions of fixed size (4)
> > - with [WAMR](https://github.com/bytecodealliance/wasm-micro-runtime) 
> > disassembling to WASM, where, when the start address is in the middle of an 
> > instruction, only that first instruction is misinterpreted
> > 
> > Without going into sections and symbols as you were proposing, the solution 
> > can be as easy as done in this VS Code Extension: 
> > https://github.com/vadimcn/vscode-lldb/commit/28ae4f4bf3bd29a0c84dd586cd5360836210ab51.
> > 
> > I'll update the code and put some additional comments to clarify the logic. 
> > Let me know what you think.
> 
> 
@clayborg what I wrote in the comment before this one is what I implemented 
(Diff 6), do you think that is not enough? I'll try it on a x86 linux machine 
to make sure it works there


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D140358/new/

https://reviews.llvm.org/D140358

_______________________________________________
lldb-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits

[Lldb-commits] [PATCH] D140358: [lldb-vscode] Add support for disassembly view

Reply via email to