dexonsmith added a comment. In D80383#2068596 <https://reviews.llvm.org/D80383#2068596>, @dang wrote:
> I made the decision to make all relative offsets relative to the start of the > AST block rather than their own sub-block or a neighboring block, in order to > prevent adding complexity in `ASTReader`. My reasoning that it would be best > if the reader didn't have to keep track of a bunch of offsets inside the > bitcode at all times, but that it could just remember the offset of the AST > block once and always use that for relative offsets. I agree that > conceptually making the sub-blocks relocatable too would have been nice, but > I figured it wasn't worth the extra complexity in the reader. I think the extra complexity is small and worth it. The default abbreviation is 7-VBR, which is cheaper for small offsets. I doubt we'll bother to come back and fix this later, so might as well get it right now. ================ Comment at: clang/include/clang/Basic/Module.h:60 + static Expected<ASTFileSignature> Create(StringRef Bytes) { + if (Bytes.size() != 20) ---------------- Can you make this `create`? I think new code we prefer `lowerCamelCase` for method names. ================ Comment at: clang/include/clang/Basic/Module.h:66-72 + for (unsigned I = 0; I != 5; ++I) { + const unsigned BitOff = I * 4; + Signature[I] = ((uint32_t)Bytes[BitOff + 0] << 24) | + ((uint32_t)Bytes[BitOff + 1] << 16) | + ((uint32_t)Bytes[BitOff + 2] << 8) | + ((uint32_t)Bytes[BitOff + 3]); + } ---------------- This code looks correct, and it's better factored out like this, but I still don't understand why we convert to 32-bit numbers. Have you looked at the code simplification from storing this as an array of 8-bit numbers? ================ Comment at: clang/include/clang/Serialization/ASTBitCodes.h:396-400 /// Record code for the signature that identifiers this AST file. SIGNATURE = 1, + /// Record code for the signature of the AST block. + AST_SIGNATURE, ---------------- dang wrote: > dang wrote: > > dexonsmith wrote: > > > These names and descriptions hard hard to differentiate. Is there another > > > way of naming these that will be more clear? > > > > > > (One idea I had is to create `CONTROL_BLOCK_HASH` and `AST_BLOCK_HASH` > > > and then `SIGNATURE` could just be their hash-combine, but maybe you have > > > another idea.) > > I kept the same hasher when computing both of these which mitigates the > > cost. I don't see the need for also emitting a hash for the control block, > > there are some optional records that are not in both the AST block and the > > control block anyway. > I also think that the `AST_BLOCK_HASH` and the `SIGNATURE` are enough > information already. In most cases you can deduce if the control block was > different by just checking if the signatures were different and the ASTs the > same. Thanks for updating to `AST_BLOCK_HASH`, I think that addressed the concern I had. However, having thought further about it, I don't think it makes sense to create a new record type. I suggest just having a convention of writing the hashes out in a particular order. - First record is the overall signature. - Second record is the AST block hash. - In the future, we can add a third record for the control block hash. In which case, the (single) record name should be something generic like `SIGNATURE` or `SHA1_HASH`. Note: I doubt there would be additional cost to computing the AST and control block hashes separately, but I agree there isn't a compelling use for the control block hash currently (although it would be nice to confirm properties like if the `CONTROL_BLOCK_HASH` were typically stable when sources changed). Note: I don't think the content of the "hashed" control block vs. unhashed control block have been carefully considered from first principles. At some point we might do that. I expect there are currently things that change the control block that would not invalidate it for importers. ================ Comment at: clang/lib/Serialization/ASTWriter.cpp:1061 - return Signature; + return std::make_pair(ASTSignature, Signature); } ---------------- dexonsmith wrote: > Alternately, you could change `ASTFileSignature` to have three fields: > - ControlBlockHash > - ASTBlockHash > - Signature > > (You could construct the signature by taking the SHA1 of the two block > hashes.) Have you considered changing `ASTFileSignature` to have multiple fields, one for the signature and another for the AST block hash? Working with pairs and tuples is always awkward. ================ Comment at: clang/lib/Serialization/ASTWriter.cpp:1084 PP.getHeaderSearchInfo().getHeaderSearchOpts().ModulesHashContent) { - Signature = createSignature(StringRef(Buffer.begin(), StartOfUnhashedControl)); + Record.append(ASTSignature.begin(), ASTSignature.end()); + Stream.EmitRecord(AST_SIGNATURE, Record); ---------------- dang wrote: > dang wrote: > > dexonsmith wrote: > > > The default abbreviation (7-VBR) isn't great for hashes. Given that we're > > > going to have two of these records, I suggest specifying a custom > > > abbreviation. > > > > > > If you were to split out a prep commit as suggested above to change > > > `SIGNATURE` to be an array of 8-bit values (instead of the current array > > > of 32-bit values), that would be a good opportunity to add an > > > abbreviation. > > I think encoding the signature as a blob makes the most sense if we make it > > an array of 20 8-bit values, which is what the hasher gives us. However it > > is not quite so simple, as it would need a big change in the way direct > > imports are serialized. The way it is currently done embeds multiple > > strings and signatures within a single unabbreviated record. I can replace > > the unabbreviated record with a sub-block that keeps track of a string > > table and a record for each import containing offsets inside the string > > table and blobs for the signature. > But I do think it is outside the scope of this change to do this. > However it is not quite so simple, as it would need a big change in the way > direct imports are serialized. The way it is currently done embeds multiple > strings and signatures within a single unabbreviated record. I don't understand why switching to 8-bit values would affect this. You can just write each 8-bit value as a separate field in the record, just like it currently does for 32-bit values. I.e., the current code seems to work just as well: ``` for (auto I : M.Signature) Record.push_back(I); ``` That said, having a single `IMPORTS` record is kind of ridiculous. It would be nice to clean that up (either as a follow-up or prep commit). I think your suggestion of having an `IMPORTS` sub-block is reasonable. A simpler approach would be just to emit `IMPORT` records. In the reader you can figure out implicitly how many there are (is the next record another `IMPORT`? If so keep reading). If you did this, I would suggest abbreviating the `IMPORT` record since it'll be well-structured. For `SIGNATURE` record abbreviation, I wasn't think a blob, just something simple like: ``` auto HashAbbrev = std::make_shared<BitCodeAbbrev>(); HashAbbrev->Add(BitCodeAbbrevOp(Signature)); for (int I = 0, E = 5; I != E; ++I) HashAbbrev->Add(BitCodeAbbrevOp(BitcodeAbbrevOp::Fixed, 32)); ``` If you switch to 8-bit: ``` auto HashAbbrev = std::make_shared<BitCodeAbbrev>(); HashAbbrev->Add(BitCodeAbbrevOp(Signature)); for (int I = 0, E = 20; I != E; ++I) HashAbbrev->Add(BitCodeAbbrevOp(BitcodeAbbrevOp::Fixed, 8)); ``` ================ Comment at: clang/lib/Serialization/ASTWriter.cpp:4729-4733 + StringRef Name = M.Kind == MK_PrebuiltModule || + M.Kind == MK_ExplicitModule || + M.Kind == MK_ImplicitModule + ? M.ModuleName + : M.FileName; ---------------- dang wrote: > dexonsmith wrote: > > Is there ever a reason to use `M.FileName` here, or is that always > > redundant with what's in the control block? I wonder if we can just remove > > this complexity. > > > The control block only tracks this information for direct imports although it > can maybe be extended to do keep track of this for all dependencies. This bit > of the table could then become and index inside the `IMPORTS` record in the > control block. Okay. That might be a nice follow-up or prep commit. If you're not interested in that tangent though I suggest adding an accessor `ModuleFile::isModule` that you can use here. ================ Comment at: clang/lib/Serialization/ASTWriterDecl.cpp:2437 if (DeclOffsets.size() == Index) - DeclOffsets.emplace_back(Loc, Offset); + DeclOffsets.emplace_back(Loc, Offset, ASTBlockRange.first); else if (DeclOffsets.size() < Index) { ---------------- dang wrote: > dexonsmith wrote: > > I suggest making `WriteDecls` relative to the start of the > > `DECLTYPES_BLOCK_ID`. That will make the block itself (more) > > self-contained, and also make the offsets smaller (and therefore cheaper to > > store in bitcode). > > > > I also don't think you need to actually emit that integer, it's just an > > implicit contract between the reader and writer. > Not sure what you mean by I don't actually need to emit that integer? If you > mean `ASTBlockRange.first` then I don't emit it but rather pass it to the > constructor of DeclOffset so that it performs the subtraction there. Since it > is a self contained type, it forces the Reader to provide a value for it, > making it harder to read the raw relative offset. Ah! I misread the code. Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D80383/new/ https://reviews.llvm.org/D80383 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits