> On Oct 22, 2020, at 1:25 AM, Jason Molenda <jmole...@apple.com> wrote: > > Hi Greg, Pavel. > > I think it's worth saying that this is very early in this project. We know > we're going to need the ability to track segments on addresses, but honestly > a lot of the finer details aren't clear yet. It's such a fundamental change > that we wanted to start a discussion, even though I know it's hard to have > detailed discussions still. > > In the envisioned environment, there will be a default segment, and most > addresses will be in the default segment. DWARF, user input (lldb cmdline), > SB API, and clang expressions are going to be the places where segments are > specified --- Dump methods and ProcessGDBRemote will be the main place where > the segments are displayed/used. There will be modifications to the memory > read/write gdb RSP packets to include these. > > This early in the project, it's hard to tell what will be upstreamed to the > llvm.org <http://llvm.org/> monorepo, or when. My personal opinion is that > we don't actually want to add segment support to llvm.org <http://llvm.org/> > lldb at this point. We'd be initializing every address object with > LLDB_INVALID_SEGMENT or LLDB_DEFAULT_SEGMENT, and then testing that each > object is initialized this way? I don't see this actually being useful. > > However, changing lldb's target addresses to be strictly handled in terms of > objects will allow us to add a segment discriminator ivar to Address and > ProcessAddress on our local branch while this is in development, and minimize > the places where we're diverging from the llvm.org <http://llvm.org/> > sources. We'll need to have local modifications at the places where a > segment is input (DWARF, cmdline, SB API, compiler type) or output (Dump, > ProcesssGDBRemote) and, hopefully, the vast majority of lldb can be > unmodified. > > The proposal was written in terms of what we need to accomplish based on our > current understanding for this project, but I think there will be a lot of > details figured out as we get more concrete experience of how this all works. > And when it's appropriate to upstream to llvm.org <http://llvm.org/>, we'll > be better prepared to discuss the tradeoffs of the approaches we took in > extending Address/ProcessAddress to incorporate a segment. > > My hope is that these generic OO'ification of target addresses will not > change lldb beyond moving off of addr_t for now. > > I included a couple of inlined comments, but I need to address more of yours > & Pavel's notes later, I've been dealing with a few crazy things and am way > behind on emails but didn't want to wait any longer to send something out.
No worries! I would vote to upstream as much as possible as soon as possible to avoid differences and merging issues for you guys. I would really like to see LLDB have support for segmented address spaces. Many comments I made were just my thinking out loud and trying to ease the changes in with as little disruption as possible. > > > >> On Oct 19, 2020, at 4:11 PM, Greg Clayton via lldb-dev >> <lldb-dev@lists.llvm.org> wrote: >> >> >> >>> On Oct 19, 2020, at 2:56 PM, Jonas Devlieghere via lldb-dev >>> <lldb-dev@lists.llvm.org> wrote: >>> >>> We want to support segmented address spaces in LLDB. Currently, all of >>> LLDB’s external API, command line interface, and internals assume that an >>> address in memory can be addressed unambiguously as an addr_t (aka >>> uint64_t). To support a segmented address space we’d need to extend addr_t >>> with a discriminator (an aspace_t) to uniquely identify a location in >>> memory. This RFC outlines what would need to change and how we propose to >>> do that. >>> >>> ### Addresses in LLDB >>> >>> Currently, LLDB has two ways of representing an address: >>> >>> - Address object. Mostly represents addresses as Section+offset for a >>> binary image loaded in the Target. An Address in this form can persist >>> across executions, e.g. an address breakpoint in a binary image that loads >>> at a different address every execution. An Address object can represent >>> memory not mapped to a binary image. Heap, stack, jitted items, will all be >>> represented as the uint64_t load address of the object, and cannot persist >>> across multiple executions. You must have the Target object available to >>> get the current load address of an Address object in the current process >>> run. Some parts of lldb do not have a Target available to them, so they >>> require that the Address can be devolved to an addr_t (aka uint64_t) and >>> passed in. >>> - The addr_t (aka uint64_t) type. Primarily used when receiving input (e.g. >>> from a user on the command line) or when interacting with the inferior >>> (reading/writing memory) for addresses that need not persist across runs. >>> Also used when reading DWARF and in our symbol tables to represent file >>> offset addresses, where the size of an Address object would be >>> objectionable. >>> >> >> Correction: LLDB has 3 kinds of uint64_t addresses: >> - "file address" which are always mapped to a section + offset if put into a >> Address object. This value only makes sense to the lldb_private::Module that >> contains it. The only way to pass this around is as a lldb_private::Address. >> You can make queries on a file address using "image lookup --address" before >> you are debugging, but a single file address can result in multiple matches >> in multiple modules because each module might contain something at this >> virtual address. This object might be able to be converted to a "load >> address" if the section is loaded in your debug target. Since the target >> contains the section load list, the target is needed when converting between >> Address and addr_t objects. >> - "load address" which is guaranteed to be unique in a process with no >> segments. It can always be put into a lldb_private::Address object, but that >> object won't always have a section. If there is no section, it means the >> memory location maps to stack, heap, or other memory that doesn't reside in >> a object file section. This object might be able to be converted to a >> section + offset address if the address matches one of the loaded sections >> in a target. If this can be converted to a Address object that has a >> section, then it can persist across debug sessions, otherwise, not. >> - "host address" which is a pointer to memory in the LLDB process itself. >> Used for storing expression results and other things. You cannot convert >> this to/from a "file" or "load" address. > > Yes, good point, host memory is a third type of address that we use. And our > symbols tables, for instance, internally represent themselves as uint64_t > offsets into the file or section, I forget which, and we're not talking about > changing those uint64_t style addresses. On our project, I do not believe > the symbol table will give us segment information. You should be able to classify symbols from the symbol table to a segment though right? We could add a special define for host addresses if needed. > > > >> >>> ## Proposal >>> >>> ### Address + ProcessAddress >>> >>> - The Address object gains a segment discriminator member variable. >>> Everything that creates an Address will need to provide this segment >>> discriminator. >> >> So an interesting thing to think about is if lldb_private::Section object >> should contain a segment identifier? If this is the case, then an Address >> object can have a Section that has a segment _and_ the Address object itself >> might have one that was set from the section as well. It would be good to >> figure out what the rules are for this case and it might lead to the need >> for an intelligent accessor that always prefers the section's segment if a >> section is available. The Address object must have one in case we have a >> pointer to memory in data and there is no section for this (like any heap >> addresses). > > I don't believe a Section in this project will have a segment. We're looking > purely at individual variables, primarily from debug information. So if you have a global variable, it will have a symbol right? And it will have debug info. Are you saying that only the debug info would have segment info? It seems important to be able to view a global variable without debug info. > > >>> - A ProcessAddress object which is a uint64_t and a segment discriminator >>> as a replacement for addr_t. ProcessAddress objects would not persist >>> across multiple executions. Similar to how you can create an addr_t from an >>> Address+Target today, you can create a ProcessAddress given an >>> Address+Target. When we pass around addr_ts today, they would be replaced >>> with ProcessAddress, with the exception of symbol tables where the added >>> space would be significant, and we do not believe we need segment >>> discriminators today. >> >> Would SegmentedAddress be a more descriptive name here? >> >> A few things I would like to see on ProcessAddress or SegmentedAddress: >> - Have a segment definition that says "no segment" like LLDB_INVALID_SEGMENT >> or LLDB_NO_SEGMENT and allow these objects to be constructed with just a >> lldb::addr_t and the segment gets auto set to LLDB_NO_SEGMENT > >> - Any code that uses these should test if there is no segment and continue >> to do what they used to do before >> - like read/write memory in ProcessGDBRemote > > > To be honest, testing this is going to be one of the tricky things I'm not > sure how we'll do. we will have a default segment that addresses will use > unless overridden, but how we spot places that *incorrectly* failed to > initialize the segment of an Address/ProcessAddress is something we're going > to need to figure out. Tests I can think of: - read function disassembly from a code segment that would have the same address as something from a data segment - read a variable from a data segment that would have the same address as something from a code segment It would be good to figure out where segments are going to come from. I would hope that some sections would be able to be mapped to certain segments so that we can live with a binary that has no debug info and still read say a global variable. I know we can do things right inside of the debug info since DWARF can have segment information. > > >> - Anything that dumps one of these objects should dump just like they used >> to (just a uint64_t hex representation and no other notation) >> - Add code that can convert a "load address" into a ProcessAddress or >> SegmentedAddress that invent the segment notation and have no changes for >> targets that don't support segmented address spaces >> - 0x1000 should convert to ProcessAddress where the address is 0x1000 and >> segment is LLDB_INVALID_SEGMENT or LLDB_NO_SEGMENT if the process doesn't >> support segmented addresses > > >> - 0x1000 would return an error on conversion for processes that do support >> segmented addresses as the segment must be specified? Or should there be a >> default segment if we run into this case? >> - Come up with some quick way to represent segmented addresses for an >> address of 0x1000 in segment 2: ideas: >> - [2]0x1000 >> - {2}0x1000 >> - 0x1000[2] >> - 0x1000{2} >> - {0x1000, 2} > > To be honest, we haven't thought about the UI side of this very much yet. I > think there will be ABI or ArchSpec style information that maps segment > numbers to human-understandable names. It's ABI style enumerated numbers - > the DWARF will include a number that is passed down to the remote gdb stub. Should be fine to have named segments if needed, but we will need to come up with a way to specify a segment. We could always add new arguments to command line commands if needed. > > >> >>> >>> ### Address Only >>> >>> Extend the lldb_private::Address class to be the one representation of >>> locations; including file based ones valid before running, file addresses >>> resolved in a process, and process specific addresses (heap/stack/JIT code) >>> that are only valid during a run. That is attractive because it would >>> provide a uniform interface to any “where is something” question you would >>> ask, either about symbols in files, variables in stack frames, etc. >>> >>> At present, when we resolve a Section+Offset Address to a “load address” we >>> provide a Target to the resolution API. Providing the Target externally >>> makes sense because a Target knows whether the Section is present or not >>> and can unambiguously return a load address. We could continue that >>> approach since the Target always holds only one process, or extend it to >>> allow passing in a Process when resolving non-file backed addresses. But >>> this would make the conversion from addr_t uses to Address uses more >>> difficult, since we will have to push the Target or Process into all the >>> API’s that make use of just an addr_t. Using a single Address class seems >>> less attractive when you have to provide an external entity to make sense >>> of it at all the use sites. >>> >>> We could improve this situation by including a Process (as a weak pointer) >>> and fill that in on the boundaries where in the current code we go from an >>> Address to a process specific addr_t. That would make the conversion >>> easier, but add complexity. Since Addresses are ubiquitous, you won’t know >>> what any given Address you’ve been handed actually contains. It could even >>> have been resolved for another process than the current one. Making >>> Address usage-dependent in this way reduces the attractiveness of the >>> solution. >>> >>> ## Approach >>> >>> Replacing all the instances of addr_t by hand would be a lot of work. >>> Therefore we propose writing a clang-based tool to automate this menial >>> task. The tool would update function signatures and replace uses of addr_t >>> inside those functions to get the addr_t from the ProcessAddress or Address >>> and return the appropriate object for functions that currently return an >>> addr_t. The goal of this tool is to generate one big NFC patch. This tool >>> needs not be perfect, at some point it will be more work to improve the >>> tool than fixing up the remaining code by hand. After this patch LLDB would >>> still not really understand address spaces but it will have everything in >>> place to support them. >> >> This won't be NFC really as each location that plays with what used to be >> addr_t now must check if the segment is invalid before doing what it did >> before _and_ return an error if the segment is something valid. >> >> It might be better to look at all of the APIs that could end up using a >> plain "addr_t" and adding new APIs that take a ProcessAddress and call the >> old API if the segment is LLDB_INVALID_SEGMENT or LLDB_NO_SEGMENT, and >> return an error if the segment is valid. For example in the Process class we >> have: >> >> virtual size_t Process::DoReadMemory(lldb::addr_t vm_addr, void *buf, size_t >> size, Status &error) = 0; >> >> We could add a new overload: >> >> virtual size_t Process::DoReadMemory(ProcessAddress proc_addr, void *buf, >> size_t size, Status &error) { >> if (proc_addr.GetSegment() == LLDB_NO_SEGMENT) >> return DoReadMemory(proc_addr.GetAddress(), but, size, error); >> error.SetErrorString("segmented addresses are not supported on this >> process"); >> return 0 >> } >> >> Then we can start modifying the locations that need to support segmented >> addresses as needed. For instance, if we were to add segmented address >> support to ProcessGDBRemote, then we would override this function in that >> class. >> >> I am not sure if slowly adding this functionality is better than replacing >> this all right away, but we can't just do a global replace without adding >> functionality or error checking IMHO. >> >> >>> Once all the APIs are updated, we can start working on the functional >>> changes. This means actually interpreting the aspace_t values and making >>> sure they don’t get dropped. >>> >>> Finally, when all this work is done and we’re happy with the approach, we >>> extend the SB API with overloads for the functions that currently take or >>> return addr_t . I want to do this last so we have time to iterate before >>> committing to a stable interface. >> >> This might be one reason for doing the approach suggested above where we add >> new internal APIs that take a ProcessAddress and cut over to using them. As >> it would mean all of the current APIs in the lldb::SB layer would remain in >> place (they can't be removed) and would still make sense. >> >>> >>> ## Testing >>> >>> By splitting off the intrusive non-functional changes we are able to rely >>> on the existing tests for coverage. Smaller functional changes can be >>> tested in isolation, either through a unit test or a small GDB remote test. >>> For end-to-end testing we can run the test suite with a modified >>> debugserver that spoofs address spaces. >> >> That makes sense. ProcessGDBRemote will need to dynamically respond with >> wether it supports segmented addresses by overloading the DoReadMemory that >> takes a ProcessAddress and do the right thing. >> >> Thanks for taking this on. I hope some of the comments above help moving >> this forward. >> >> Greg >> >>> >>> Thanks, >>> Jonas >>> >>> _______________________________________________ >>> lldb-dev mailing list >>> lldb-dev@lists.llvm.org >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev >> >> _______________________________________________ >> lldb-dev mailing list >> lldb-dev@lists.llvm.org <mailto:lldb-dev@lists.llvm.org> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev >> <https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev>
_______________________________________________ lldb-dev mailing list lldb-dev@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev