clayborg added inline comments.

================
Comment at: lldb/tools/lldb-vscode/lldb-vscode.cpp:2177
+  const auto max_instruction_size = g_vsc.target.GetMaximumOpcodeByteSize();
+  const auto bytes_offset = -instruction_offset * max_instruction_size;
+  auto start_addr = base_addr - bytes_offset;
----------------
eloparco wrote:
> clayborg wrote:
> > Just checked out your changes, and you are still just subtracting a value 
> > from the start address and attempting to disassemble from memory which is 
> > the problem. We need to take that subtracted address, and look it up as 
> > suggested in previous code examples I posted. If you find a function to 
> > symbol, ask those objects for their instructions. and then try to use 
> > those. 
> > 
> > But basically for _any_ disassembly this is what I would recommend doing:
> > - first resolve the "start_address" (no matter how you come up the address) 
> > that want to disassemble into a SBAddress
> > - check its section. If the section is valid and contains instructions, 
> > call a function that will disassemble the address range for the section 
> > that starts at "start_address" and ends at the end of the section. We can 
> > call this "disassemble_code" as a function. More details on this below
> > - If the section does not contain instructions, just read the bytes and 
> > emit a lines like:
> > ```
> > 0x1000 .byte 0x12
> > 0x1000 .byte 0x34
> > ...
> > ```
> > 
> > Now for the disassemble_code function. We know the address range for this 
> > is in code. We then need to resolve the address passed to 
> > "disassemble_code" into a SBAddress and ask that address for a SBFunction 
> > or SBSymbol as I mentioned. Then we ask the SBFunction or SBSymbol for all 
> > instructions that they contain, and then use any instructions that fall 
> > into the range we have. If there is no SBFunction or SBSymbol, then 
> > disassemble an instruction at a time and then see if the new address will 
> > resolve to a function or symbol.
> Tried my changes on a linux x86 machine and the loop `for (unsigned i = 0; i 
> < max_instruction_size; i++) {` (L2190) takes care of the `start_address` 
> possibly being in the middle of an instruction, so that's not a problem.  The 
> problem I faced is that it tries to read too far from `base_addr` and the 
> `ReadMemory()` operation returns few instructions (without reaching 
> `base_addr`). That was not happening on my macOS M1 (arm) machine. 
> 
> To solve, I changed the loop at L2190 to
> ```
> for (unsigned i = 0; i < bytes_offset; i++) {
>     auto sb_instructions =
>         _get_instructions_from_memory(start_addr + i, disassemble_bytes);
> ```
> and if `start_addr` is in `sb_instructions` we're done and can exit the loop. 
> That worked.
> 
> Another similar thing that can be done is to start from `start_sbaddr` as you 
> were saying, increment the address until a valid section is found. Then call 
> `_get_instructions_from_memory()` passing the section start.
> What do you think? Delegating the disassembling to `ReadMemory()` + 
> `GetInstructions()` looks simpler to me than to manually iterate over 
> sections and get instructions from symbols and functions.
> Is there any shortcoming I'm not seeing?
so your 
```
for (unsigned i = 0; i < max_instruction_size; i++) { 
```
disassembles and tries to make sure you make it back to the original base 
address from the original disassemble packet? That can work but could a a bit 
time consuming?

The main issue, as you saw on x86, is you don't know what is in memory. You 
could have unreadable areas of memory when trying to disassemble. Also if you 
do have good memory that does contain instructions, there can be padding in 
between functions or even data between functions that the function accesses 
that can't be correctly disassembled and could throw things off again. 

The memory regions are the safest way to traverse memory to see what you have 
and would help you deal with holes in the memory. You can ask about a memory 
region with:
```
lldb::SBError SBProcess::GetMemoryRegionInfo(lldb::addr_t load_addr, 
lldb::SBMemoryRegionInfo &region_info);
```
If you ask about an invalid address, like zero on most platforms, it will 
return a valid "region_info" with the bounds of the unreadable address range 
filled in no read/write/execute permissions:
```
(lldb) script
Python Interactive Interpreter. To exit, type 'quit()', 'exit()' or Ctrl-D.
>>> region = lldb.SBMemoryRegionInfo()
>>> err = lldb.process.GetMemoryRegionInfo(0, region)
[0x0000000000000000-0x0000000100000000 ---]
```
So you could use the memory region info to your advantage here. If you have 
execute permissions, disassemble as instructions, and if you don't emit ".byte 
0xXX" for each byte. If there are no permissions, you can emit some other 
string like "0x00000000: <invalid memory>". 

That being said, even when you do find an executable section of memory, there 
can be different stuff there even if a section _is_ executable. For instance, 
if we ask about the next memory region on a mac M1:
```
>>> err = lldb.process.GetMemoryRegionInfo(0x0000000100000000, region)
>>> print(region)
[0x0000000100000000-0x0000000100004000 R-X]
```
Notice that this is read + execute (you can access these via:
```
region.IsReadable()
region.IsWritable()
region.IsExecutable()
```
But at this address, this is the mach-o header which doesn't make sense to try 
and disassemble:
```
(lldb) memory read 0x0000000100000000
0x100000000: cf fa ed fe 0c 00 00 01 00 00 00 00 02 00 00 00  ................
0x100000010: 11 00 00 00 18 03 00 00 85 00 20 00 00 00 00 00  .......... .....
(lldb) memory read -fx -s4 -c 4 0x0000000100000000
0x100000000: 0xfeedfacf 0x0100000c 0x00000000 0x00000002
```
0xfeedfacf is the mach-o magic bytes for little endian 64 bit mach-o files.
```
(lldb) disassemble --start-address 0x0000000100000000
a.out`_mh_execute_header:
    0x100000000 <+0>:  .long  0xfeedfacf                ; unknown opcode
    0x100000004 <+4>:  .long  0x0100000c                ; unknown opcode
    0x100000008 <+8>:  udf    #0x0
    0x10000000c <+12>: udf    #0x2
    0x100000010 <+16>: udf    #0x11
    0x100000014 <+20>: udf    #0x318
    0x100000018 <+24>: .long  0x00200085                ; unknown opcode
    0x10000001c <+28>: udf    #0x0
```

So this is the main reason why I would suggest just disassembling using .byte 
or .long when we aren't in a function or symbol.



Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D140358/new/

https://reviews.llvm.org/D140358

_______________________________________________
lldb-commits mailing list
lldb-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits

Reply via email to