Sounds like a plan.
On Tue, 7 Aug 2018 at 01:30, Leonard Mosescu <mose...@google.com> wrote: >> >> I am fine with all the above except some reservations about case C. No need >> to calculate something if it isn't useful. For case C it should be fine to >> never match as if a file has a UUID to begin with it typically isn't >> something that gets stripped in a stripped binary. So we should either have >> it or not. If breakpad does calculate a CRC32, then we need to know to >> ignore the UUID. The problem is we probably won't be able to tell what the >> UUID is: real from build ID, or from GNU debug info CRC, or CRC of entire >> file. So the minidump code will need to do something here. If a minidump has >> the linux auxv and memory map in them, then we might need to dig through the >> section information and deduce if a file matches or not based off the size >> of mapped program headers to further help with the matching. >> >> One other idea is to make a set of enumerations for the UUID type: >> >> class UUID { >> enum class Type { >> BuildID, // A build ID from the compiler or linker >> GNUDebugInfoCRC, // GNU debug info CRC >> MD5, // MD5 of entire file >> MD5NonDebug, // MD5 of the non debug info related bits >> CRC32, // CRC32 of entire file >> Other, // Anything else >> }; >> }; >> >> The eTypeMD5NonDebug is what apple does: it MD5 checksums only the parts of >> the file that don't change with debug info or any paths found in debug info >> or symbols tables. So if you build a binary in /tmp/a or in >> /private/var/local/foo, the UUID is the same if the binary is essentially >> the same (code, data, etc). >> >> Then we can make intelligent comparisons between UUID types. Might even be >> possible for a module to have more than 1 UUID then if a binary contains a >> eTypeBuildID and a eTypeGNUDebugInfoCRC. If a tool stores its UUIDs as a >> CRC32 or MD5, then those can be calculated on the fly. The GetUUID on >> lldb_private::Module might become: >> >> const lldb_private::UUID &Module::GetUUID(UUID::Type uuid_type); >> >> Thoughts? >> >> Greg > > > I like the idea of UUID sub-types! This solves the problem is a more generic > fashion and it's also extensible. Interestingly enough, for Crashpad we're > considering something similar (the fabricated UUIDs would have a different > CvRecord signature) > > Case C. is a bit ugly so let me elaborate: this is specific to Breakpad > minidump + ELF binaries w/o build-id. So we'd still need to have a way to > force the match of the modules in the minidump with the local files. This > ought to be an off-by-default, sharp tool which you'd only need to deal with > Breakpad minidumps, and even then it would only be a fall-back that must be > explicitly requested. > >> 1. .gnu_debuglink separate file pointer >> <https://sourceware.org/gdb/onlinedocs/gdb/Separate-Debug-Files.html>. >> This is where the choice of the crc algorithm comes from. > > > Thanks Pavel. As you noted yourself, this doesn't mean that the UUID has to > be tied to the checksum (they are exclusive options). In particular, for > performance reasons I think it's desirable to avoid calculating checksums for > every ELF module just in case. > >> I think we might have something already which could serve this >> purpose. Eugene added a couple of months ago a mechanism to force-load >> symbols for a given file regardless of the UUIDs >> <https://reviews.llvm.org/D35607>. It requires an explicit "target >> symbols add" command (which seems reasonable, as lldb has no way to >> tell if it's doing things right). Would something like that work for >> you? > > > Nice. We may have to update the C++ API, but something like this would do the > trick for case C. > > To summarize the conversation so far: > > 1. We can fix cases A, B independent of C: if real UUIDs are missing, don't > automatically use full file CRC32 as UUID. > 2. Pay attention to make sure that we don't break .gnu_debuglink or remote > debugging (thanks Pavel) > 3. Multiple types/namespaces for UUIDs would be a great feature! > 4. Eugene's symbol forcing trick could be extended to handle case C > > Did I miss anything? > > My current plan is to start with #1, then look into #4 (force symbols match). > > > On Sun, Aug 5, 2018 at 12:11 PM, Pavel Labath <lab...@google.com> wrote: >> >> Hello Leonard, >> >> while I'm in principle supportive of this idea, I think it's not going >> to be as easy as you might imagine. There are currently at least two >> mechanisms which rely on this crc UUID. >> >> 1. .gnu_debuglink separate file pointer >> <https://sourceware.org/gdb/onlinedocs/gdb/Separate-Debug-Files.html>. >> This is where the choice of the crc algorithm comes from. >> >> In short, this mechanism for debug info location works like this: The >> stripped file contains a .gnu_debuglink section. The section contains >> a file path and a crc checksum. After reading this section the >> debugger is expected to look for the file at the given path, and then >> compute it's checksum to verify it is indeed the correct file (hasn't >> been modified). >> >> In LLDB, this is implemented somewhat differently. First we have a >> mechanism for assigning UUIDs to modules. This mechanism does one of >> two things (I'm assuming here none of the files have proper build-ids >> in them). If the file has a .gnu_debuglink section (the stripped >> file), we fetch the CRC from there, and assign that as the build-id of >> the file. If the file doesn't have this section, we **assume** it is >> going to be a target of gnu_debuglink pointer in another file, compute >> its CRC via the specified algorithm, and assign that as the build-id. >> >> Second, we have a mechanism for locating the unstripped file. This >> just fetches the path from gnu_debuglink, and compares the UUIDs of >> the two modules. This works because of the UUID algorithm in the first >> part. >> >> Now, this is not a particularly smart way of doing things and it is >> the cause of the most of your problems. However, it is completely >> avoidable. Instead of piggy-backing on the UUID mechanism, we should >> just change the search mechanism (the second part) to compute the crc >> itself, and only after it successfully finds the file referenced by >> the gnu_debuglink section. This way, we will only compute the crc only >> when absolutely necessary (i.e. for your use case, never). >> >> >> 2. Currently, for remote debugging, we assume that each module has >> some sort of a unique identifier which we can check to see whether we >> have downloaded that file already (see qModuleInfo packet). I am not >> sure what would happen if we suddenly just stopped computing a UUID >> for these files, but we at the very least lose the ability to detect >> changes to the remote files. For this item, I am not sure what would >> be the best course of action. Maybe we should just start relying on >> modification timestamp for these files? >> >> On Sat, 4 Aug 2018 at 02:17, Leonard Mosescu <mose...@google.com> wrote: >> > >> > Greg, Mark, >> > >> > Looking at the code, LLDB falls back to a full file crc32 to create the >> > module UUID if the ELF build-id is missing. This works, in the sense that >> > the generated UUID does indeed identify the module. >> > >> > But there are a few problems with this approach: >> > >> > 1. First, runtime performance: a full file crc32 is a terribly inefficient >> > way to generate a temporary UUID that is basically just used to match a >> > local file to itself. >> > - especially when some unstripped binaries can be very large. for example >> > a local chromium build produces a 5.3Gb chrome binary >> > - the crc32 implementation is decent, but single-threaded >> > - to add insult to the injury, it seems a small bug defeats the intention >> > to cache the hash value so it ends up being recalculated multiple times >> > >> > 2. The fake UUID is not going to match any external UUID that may be >> > floating around (and yet not properly embedded into the binary) >> > - an example is Breakpad, which unfortunately also attempts to make up >> > UUIDs when the build-id is missing (something we'll hopefully fix soon) >> > >> > Is there a fundamental reason to calculate the full file crc32? If not I >> > propose to improve this based on the following observations: >> > >> > A. Model the reality more accurately: an ELF w/o a build-id doesn't really >> > have an UUID. So use a zero-length UUID in LLDB. >> > B. The full file name should be enough to prove the identity of a local >> > module. >> > C. If we try to match an external UUID (ex. from a minidump) with a local >> > file which does not have an UUID it may help to have an option to allow it >> > to match (off by default, and only if there's no better match) >> >> I think we might have something already which could serve this >> purpose. Eugene added a couple of months ago a mechanism to force-load >> symbols for a given file regardless of the UUIDs >> <https://reviews.llvm.org/D35607>. It requires an explicit "target >> symbols add" command (which seems reasonable, as lldb has no way to >> tell if it's doing things right). Would something like that work for >> you? >> >> cheers, >> pl > > _______________________________________________ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev