eloparco added inline comments.
================
Comment at: lldb/tools/lldb-vscode/lldb-vscode.cpp:2177
+ const auto bytes_offset = -instruction_offset * bytes_per_instruction;
+ auto start_addr = base_addr - bytes_offset;
+ const auto disassemble_bytes = instruction_count * bytes_per_instruction;
----------------
eloparco wrote:
> clayborg wrote:
> > eloparco wrote:
> > > clayborg wrote:
> > > > This will work if the min and max opcode byte size are the same, like
> > > > for arm64, the min and max are 4. This won't work for x86 or arm32 in
> > > > thumb mode. So when backing up, we need to do an address lookup on the
> > > > address we think we want to go to, and then adjust the starting address
> > > > accordingly. Something like:
> > > >
> > > > ```
> > > > SBAddress start_sbaddr = (base_addr - bytes_offset, g_vsc.target);
> > > > ```
> > > > now we have a section offset address that can tell us more about what
> > > > it is. We can find the SBFunction or SBSymbol for this address and use
> > > > those to find the right instructions. This will allow us to correctly
> > > > disassemble code bytes.
> > > >
> > > > We can also look at the section that the memory comes from and see what
> > > > the section contains. If the section is data, then emit something like:
> > > > ```
> > > > 0x00001000 .byte 0x23
> > > > 0x00001001 .byte 0x34
> > > > ...
> > > > ```
> > > > To find the section type we can do:
> > > > ```
> > > > SBSection section = start_sbaddr.GetSection();
> > > > if (section.IsValid() && section.GetSectionType() ==
> > > > lldb::eSectionTypeCode) {
> > > > // Disassemble from a valid boundary
> > > > } else {
> > > > // Emit a byte or long at a time with ".byte 0xXX" or other ASM
> > > > directive for binary data
> > > > }
> > > > ```
> > > > We need to ensure we start disassembling on the correct instruction
> > > > boundary as well as our math for "start_addr" might be in between
> > > > actual opcode boundaries. If we are in a lldb::eSectionTypeCode, then
> > > > we know we have instructions, and if we are not, then we can emit
> > > > ".byte" or other binary data directives. So if we do have
> > > > lldb::eSectionTypeCode as our section type, then we should have a
> > > > function or symbol, and we can get instructions from those objects
> > > > easily:
> > > >
> > > > ```
> > > > if (section.IsValid() && section.GetSectionType() ==
> > > > lldb::eSectionTypeCode) {
> > > > lldb::SBInstructionList instructions;
> > > > lldb::SBFunction function = start_sbaddr.GetFunction();
> > > > if (function.IsValid()) {
> > > > instructions = function.GetInstructions(g_vsc.target);
> > > > } else {
> > > > symbol = start_sbaddr.GetSymbol();
> > > > if (symbol.IsValid())
> > > > instructions = symbol.GetInstructions(g_vsc.target);
> > > > }
> > > > const size_t num_instrs = instructions.GetSize();
> > > > if (num_instrs > 0) {
> > > > // we found instructions from a function or symbol and we need to
> > > > // find the matching instruction that we want to start from by
> > > > iterating
> > > > // over the instructions and finding the right address
> > > > size_t matching_idx = num_instrs; // Invalid index
> > > > for (size_t i=0; i<num_instrs; ++i) {
> > > > lldb::SBInstruction inst = instructions.GetInstructionAtIndex(i);
> > > > if (inst.GetAddress().GetLoadAddress(g_vsc.target) >= start_addr) {
> > > > matching_idx = i;
> > > > break;
> > > > }
> > > > }
> > > > if (matching_idx < num_instrs) {
> > > > // Now we can print the instructions from [matching_idx, num_instrs)
> > > > // then we need to repeat the search for the next function or
> > > > symbol.
> > > > // note there may be bytes between functions or symbols which we
> > > > can disassemble
> > > > // by calling _get_instructions_from_memory(...) but we must find
> > > > the next
> > > > // symbol or function boundary and get back on track
> > > > }
> > > >
> > > > ```
> > > Sorry, I should have provided a proper explanation.
> > >
> > > I use the maximum instruction size as the "worst case". Basically, I need
> > > to read a portion of memory but I do not know the start address and the
> > > size. For the start address, if I want to read N instructions before
> > > `base_addr` I need to read at least starting from `base_addr - N *
> > > max_instruction_size`: if all instructions are of size
> > > `max_instruction_size` I will read exactly N instructions; otherwise I
> > > will read more than N instructions and prune the additional ones
> > > afterwards. Same for applies for the size.
> > >
> > > Since `start_addr` is based on a "worst case", it may be an address in
> > > the middle of an instruction. In that case, that first instruction will
> > > be misinterpreted, but I think that is negligible.
> > >
> > > The logic is similar to what is implemented in other VS Code extensions,
> > > like
> > > https://github.com/vadimcn/vscode-lldb/blob/master/adapter/src/debug_session.rs#L1134.
> > >
> > > Does it make sense?
> > The issue is, you might end up backing up by N bytes, and you might not end
> > up on an opcode boundary. Lets say you have x86 disassembly like:
> >
> > ```
> > 0x100002e37 <+183>: 48 8b 4d f8 movq -0x8(%rbp), %rcx
> > 0x100002e3b <+187>: 48 39 c8 cmpq %rcx, %rax
> > 0x100002e3e <+190>: 0f 85 09 00 00 00 jne 0x100002e4d
> > ; <+205> at main.cpp
> > 0x100002e44 <+196>: 8b 45 94 movl -0x6c(%rbp), %eax
> > 0x100002e47 <+199>: 48 83 c4 70 addq $0x70, %rsp
> > 0x100002e4b <+203>: 5d popq %rbp
> > 0x100002e4c <+204>: c3 retq
> > 0x100002e4d <+205>: e8 7e 0f 00 00 callq 0x100003dd0
> > ; symbol stub for: __stack_chk_fail
> > 0x100002e52 <+210>: 0f 0b ud2
> > ```
> >
> > Let's say you started with the address 0x100002e4c, and backed up by the
> > max opcode size of 15 for x86_64, that would be 0x100002e3d. You would
> > start disassembling on a non opcode boundary as this is the last byte of
> > the 3 byte opcode at 0x100002e3b (0xc8). And this would disassembly
> > incorrectly. So we need to make use of the functions or symbol boundaries
> > to ensure we are disassembling correctly. If we have no function or symbol,
> > we can do our best. But as you can see we would get things completely wrong
> > in this case and we need to fix this as detailed.
> Actually, I was mislead by the fact that so far I tried on both:
> - my ARM machine, with instructions of fixed size (4)
> - with [WAMR](https://github.com/bytecodealliance/wasm-micro-runtime)
> disassembling to WASM, where, when the start address is in the middle of an
> instruction, only that first instruction is misinterpreted
>
> Without going into sections and symbols as you were proposing, the solution
> can be as easy as done in this VS Code Extension:
> https://github.com/vadimcn/vscode-lldb/commit/28ae4f4bf3bd29a0c84dd586cd5360836210ab51.
>
> I'll update the code and put some additional comments to clarify the logic.
> Let me know what you think.
> Actually, I was mislead by the fact that so far I tried on both:
> - my ARM machine, with instructions of fixed size (4)
> - with [WAMR](https://github.com/bytecodealliance/wasm-micro-runtime)
> disassembling to WASM, where, when the start address is in the middle of an
> instruction, only that first instruction is misinterpreted
>
> Without going into sections and symbols as you were proposing, the solution
> can be as easy as done in this VS Code Extension:
> https://github.com/vadimcn/vscode-lldb/commit/28ae4f4bf3bd29a0c84dd586cd5360836210ab51.
>
> I'll update the code and put some additional comments to clarify the logic.
> Let me know what you think.
================
Comment at: lldb/tools/lldb-vscode/lldb-vscode.cpp:2177
+ const auto bytes_offset = -instruction_offset * bytes_per_instruction;
+ auto start_addr = base_addr - bytes_offset;
+ const auto disassemble_bytes = instruction_count * bytes_per_instruction;
----------------
eloparco wrote:
> eloparco wrote:
> > clayborg wrote:
> > > eloparco wrote:
> > > > clayborg wrote:
> > > > > This will work if the min and max opcode byte size are the same, like
> > > > > for arm64, the min and max are 4. This won't work for x86 or arm32 in
> > > > > thumb mode. So when backing up, we need to do an address lookup on
> > > > > the address we think we want to go to, and then adjust the starting
> > > > > address accordingly. Something like:
> > > > >
> > > > > ```
> > > > > SBAddress start_sbaddr = (base_addr - bytes_offset, g_vsc.target);
> > > > > ```
> > > > > now we have a section offset address that can tell us more about what
> > > > > it is. We can find the SBFunction or SBSymbol for this address and
> > > > > use those to find the right instructions. This will allow us to
> > > > > correctly disassemble code bytes.
> > > > >
> > > > > We can also look at the section that the memory comes from and see
> > > > > what the section contains. If the section is data, then emit
> > > > > something like:
> > > > > ```
> > > > > 0x00001000 .byte 0x23
> > > > > 0x00001001 .byte 0x34
> > > > > ...
> > > > > ```
> > > > > To find the section type we can do:
> > > > > ```
> > > > > SBSection section = start_sbaddr.GetSection();
> > > > > if (section.IsValid() && section.GetSectionType() ==
> > > > > lldb::eSectionTypeCode) {
> > > > > // Disassemble from a valid boundary
> > > > > } else {
> > > > > // Emit a byte or long at a time with ".byte 0xXX" or other ASM
> > > > > directive for binary data
> > > > > }
> > > > > ```
> > > > > We need to ensure we start disassembling on the correct instruction
> > > > > boundary as well as our math for "start_addr" might be in between
> > > > > actual opcode boundaries. If we are in a lldb::eSectionTypeCode, then
> > > > > we know we have instructions, and if we are not, then we can emit
> > > > > ".byte" or other binary data directives. So if we do have
> > > > > lldb::eSectionTypeCode as our section type, then we should have a
> > > > > function or symbol, and we can get instructions from those objects
> > > > > easily:
> > > > >
> > > > > ```
> > > > > if (section.IsValid() && section.GetSectionType() ==
> > > > > lldb::eSectionTypeCode) {
> > > > > lldb::SBInstructionList instructions;
> > > > > lldb::SBFunction function = start_sbaddr.GetFunction();
> > > > > if (function.IsValid()) {
> > > > > instructions = function.GetInstructions(g_vsc.target);
> > > > > } else {
> > > > > symbol = start_sbaddr.GetSymbol();
> > > > > if (symbol.IsValid())
> > > > > instructions = symbol.GetInstructions(g_vsc.target);
> > > > > }
> > > > > const size_t num_instrs = instructions.GetSize();
> > > > > if (num_instrs > 0) {
> > > > > // we found instructions from a function or symbol and we need to
> > > > > // find the matching instruction that we want to start from by
> > > > > iterating
> > > > > // over the instructions and finding the right address
> > > > > size_t matching_idx = num_instrs; // Invalid index
> > > > > for (size_t i=0; i<num_instrs; ++i) {
> > > > > lldb::SBInstruction inst = instructions.GetInstructionAtIndex(i);
> > > > > if (inst.GetAddress().GetLoadAddress(g_vsc.target) >= start_addr)
> > > > > {
> > > > > matching_idx = i;
> > > > > break;
> > > > > }
> > > > > }
> > > > > if (matching_idx < num_instrs) {
> > > > > // Now we can print the instructions from [matching_idx,
> > > > > num_instrs)
> > > > > // then we need to repeat the search for the next function or
> > > > > symbol.
> > > > > // note there may be bytes between functions or symbols which we
> > > > > can disassemble
> > > > > // by calling _get_instructions_from_memory(...) but we must find
> > > > > the next
> > > > > // symbol or function boundary and get back on track
> > > > > }
> > > > >
> > > > > ```
> > > > Sorry, I should have provided a proper explanation.
> > > >
> > > > I use the maximum instruction size as the "worst case". Basically, I
> > > > need to read a portion of memory but I do not know the start address
> > > > and the size. For the start address, if I want to read N instructions
> > > > before `base_addr` I need to read at least starting from `base_addr - N
> > > > * max_instruction_size`: if all instructions are of size
> > > > `max_instruction_size` I will read exactly N instructions; otherwise I
> > > > will read more than N instructions and prune the additional ones
> > > > afterwards. Same for applies for the size.
> > > >
> > > > Since `start_addr` is based on a "worst case", it may be an address in
> > > > the middle of an instruction. In that case, that first instruction will
> > > > be misinterpreted, but I think that is negligible.
> > > >
> > > > The logic is similar to what is implemented in other VS Code
> > > > extensions, like
> > > > https://github.com/vadimcn/vscode-lldb/blob/master/adapter/src/debug_session.rs#L1134.
> > > >
> > > > Does it make sense?
> > > The issue is, you might end up backing up by N bytes, and you might not
> > > end up on an opcode boundary. Lets say you have x86 disassembly like:
> > >
> > > ```
> > > 0x100002e37 <+183>: 48 8b 4d f8 movq -0x8(%rbp), %rcx
> > > 0x100002e3b <+187>: 48 39 c8 cmpq %rcx, %rax
> > > 0x100002e3e <+190>: 0f 85 09 00 00 00 jne 0x100002e4d
> > > ; <+205> at main.cpp
> > > 0x100002e44 <+196>: 8b 45 94 movl -0x6c(%rbp), %eax
> > > 0x100002e47 <+199>: 48 83 c4 70 addq $0x70, %rsp
> > > 0x100002e4b <+203>: 5d popq %rbp
> > > 0x100002e4c <+204>: c3 retq
> > > 0x100002e4d <+205>: e8 7e 0f 00 00 callq 0x100003dd0
> > > ; symbol stub for: __stack_chk_fail
> > > 0x100002e52 <+210>: 0f 0b ud2
> > > ```
> > >
> > > Let's say you started with the address 0x100002e4c, and backed up by the
> > > max opcode size of 15 for x86_64, that would be 0x100002e3d. You would
> > > start disassembling on a non opcode boundary as this is the last byte of
> > > the 3 byte opcode at 0x100002e3b (0xc8). And this would disassembly
> > > incorrectly. So we need to make use of the functions or symbol
> > > boundaries to ensure we are disassembling correctly. If we have no
> > > function or symbol, we can do our best. But as you can see we would get
> > > things completely wrong in this case and we need to fix this as detailed.
> > Actually, I was mislead by the fact that so far I tried on both:
> > - my ARM machine, with instructions of fixed size (4)
> > - with [WAMR](https://github.com/bytecodealliance/wasm-micro-runtime)
> > disassembling to WASM, where, when the start address is in the middle of an
> > instruction, only that first instruction is misinterpreted
> >
> > Without going into sections and symbols as you were proposing, the solution
> > can be as easy as done in this VS Code Extension:
> > https://github.com/vadimcn/vscode-lldb/commit/28ae4f4bf3bd29a0c84dd586cd5360836210ab51.
> >
> > I'll update the code and put some additional comments to clarify the logic.
> > Let me know what you think.
> > Actually, I was mislead by the fact that so far I tried on both:
> > - my ARM machine, with instructions of fixed size (4)
> > - with [WAMR](https://github.com/bytecodealliance/wasm-micro-runtime)
> > disassembling to WASM, where, when the start address is in the middle of an
> > instruction, only that first instruction is misinterpreted
> >
> > Without going into sections and symbols as you were proposing, the solution
> > can be as easy as done in this VS Code Extension:
> > https://github.com/vadimcn/vscode-lldb/commit/28ae4f4bf3bd29a0c84dd586cd5360836210ab51.
> >
> > I'll update the code and put some additional comments to clarify the logic.
> > Let me know what you think.
>
>
@clayborg what I wrote in the comment before this one is what I implemented
(Diff 6), do you think that is not enough? I'll try it on a x86 linux machine
to make sure it works there
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D140358/new/
https://reviews.llvm.org/D140358
_______________________________________________
lldb-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits