[PATCH] D80383: Add AST_SIGNATURE record to unhashed control block of PCM files

Duncan P. N. Exon Smith via Phabricator via cfe-commits Wed, 03 Jun 2020 12:40:26 -0700

dexonsmith added a comment.

In D80383#2068596 <https://reviews.llvm.org/D80383#2068596>, @dang wrote:


> I made the decision to make all relative offsets relative to the start of the 
> AST block rather than their own sub-block or a neighboring block, in order to 
> prevent adding complexity in `ASTReader`. My reasoning that it would be best 
> if the reader didn't have to keep track of a bunch of offsets inside the 
> bitcode at all times, but that it could just remember the offset of the AST 
> block once and always use that for relative offsets. I agree that 
> conceptually making the sub-blocks relocatable too would have been nice, but 
> I figured it wasn't worth the extra complexity in the reader.


I think the extra complexity is small and worth it. The default abbreviation is 
7-VBR, which is cheaper for small offsets. I doubt we'll bother to come back 
and fix this later, so might as well get it right now.



================
Comment at: clang/include/clang/Basic/Module.h:60
 
+  static Expected<ASTFileSignature> Create(StringRef Bytes) {
+    if (Bytes.size() != 20)
----------------
Can you make this `create`? I think new code we prefer `lowerCamelCase` for 
method names.


================
Comment at: clang/include/clang/Basic/Module.h:66-72
+    for (unsigned I = 0; I != 5; ++I) {
+      const unsigned BitOff = I * 4;
+      Signature[I] = ((uint32_t)Bytes[BitOff + 0] << 24) |
+                     ((uint32_t)Bytes[BitOff + 1] << 16) |
+                     ((uint32_t)Bytes[BitOff + 2] << 8) |
+                     ((uint32_t)Bytes[BitOff + 3]);
+    }
----------------
This code looks correct, and it's better factored out like this, but I still 
don't understand why we convert to 32-bit numbers. Have you looked at the code 
simplification from storing this as an array of 8-bit numbers?


================
Comment at: clang/include/clang/Serialization/ASTBitCodes.h:396-400
       /// Record code for the signature that identifiers this AST file.
       SIGNATURE = 1,
 
+      /// Record code for the signature of the AST block.
+      AST_SIGNATURE,
----------------
dang wrote:
> dang wrote:
> > dexonsmith wrote:
> > > These names and descriptions hard hard to differentiate. Is there another 
> > > way of naming these that will be more clear?
> > > 
> > > (One idea I had is to create `CONTROL_BLOCK_HASH` and `AST_BLOCK_HASH` 
> > > and then `SIGNATURE` could just be their hash-combine, but maybe you have 
> > > another idea.)
> > I kept the same hasher when computing both of these which mitigates the 
> > cost. I don't see the need for also emitting a hash for the control block, 
> > there are some optional records that are not in both the AST block and the 
> > control block anyway.
> I also think that the `AST_BLOCK_HASH` and the `SIGNATURE` are enough 
> information already. In most cases you can deduce if the control block was 
> different by just checking if the signatures were different and the ASTs the 
> same.
Thanks for updating to `AST_BLOCK_HASH`, I think that addressed the concern I 
had.

However, having thought further about it, I don't think it makes sense to 
create a new record type. I suggest just having a convention of writing the 
hashes out in a particular order.
- First record is the overall signature.
- Second record is the AST block hash.
- In the future, we can add a third record for the control block hash.

In which case, the (single) record name should be something generic like 
`SIGNATURE` or `SHA1_HASH`.

Note: I doubt there would be additional cost to computing the AST and control 
block hashes separately, but I agree there isn't a compelling use for the 
control block hash currently (although it would be nice to confirm properties 
like if the `CONTROL_BLOCK_HASH` were typically stable when sources changed).

Note: I don't think the content of the "hashed" control block vs. unhashed 
control block have been carefully considered from first principles. At some 
point we might do that. I expect there are currently things that change the 
control block that would not invalidate it for importers.


================
Comment at: clang/lib/Serialization/ASTWriter.cpp:1061
 
-  return Signature;
+  return std::make_pair(ASTSignature, Signature);
 }
----------------
dexonsmith wrote:
> Alternately, you could change `ASTFileSignature` to have three fields:
> - ControlBlockHash
> - ASTBlockHash
> - Signature
> 
> (You could construct the signature by taking the SHA1 of the two block 
> hashes.)
Have you considered changing `ASTFileSignature` to have multiple fields, one 
for the signature and another for the AST block hash? Working with pairs and 
tuples is always awkward.


================
Comment at: clang/lib/Serialization/ASTWriter.cpp:1084
       PP.getHeaderSearchInfo().getHeaderSearchOpts().ModulesHashContent) {
-    Signature = createSignature(StringRef(Buffer.begin(), 
StartOfUnhashedControl));
+    Record.append(ASTSignature.begin(), ASTSignature.end());
+    Stream.EmitRecord(AST_SIGNATURE, Record);
----------------
dang wrote:
> dang wrote:
> > dexonsmith wrote:
> > > The default abbreviation (7-VBR) isn't great for hashes. Given that we're 
> > > going to have two of these records, I suggest specifying a custom 
> > > abbreviation.
> > > 
> > > If you were to split out a prep commit as suggested above to change 
> > > `SIGNATURE` to be an array of 8-bit values (instead of the current array 
> > > of 32-bit values), that would be a good opportunity to add an 
> > > abbreviation.
> > I think encoding the signature as a blob makes the most sense if we make it 
> > an array of 20 8-bit values, which is what the hasher gives us. However it 
> > is not quite so simple, as it would need a big change in the way direct 
> > imports are serialized. The way it is currently done embeds multiple 
> > strings and signatures within a single unabbreviated record. I can replace 
> > the unabbreviated record with a sub-block that keeps track of a string 
> > table and a record for each import containing offsets inside the string 
> > table and blobs for the signature.
> But I do think it is outside the scope of this change to do this.
> However it is not quite so simple, as it would need a big change in the way 
> direct imports are serialized. The way it is currently done embeds multiple 
> strings and signatures within a single unabbreviated record.

I don't understand why switching to 8-bit values would affect this. You can 
just write each 8-bit value as a separate field in the record, just like it 
currently does for 32-bit values. I.e., the current code seems to work just as 
well:
```
      for (auto I : M.Signature)
        Record.push_back(I);
```

That said, having a single `IMPORTS` record is kind of ridiculous. It would be 
nice to clean that up (either as a follow-up or prep commit). I think your 
suggestion of having an `IMPORTS` sub-block is reasonable. A simpler approach 
would be just to emit `IMPORT` records. In the reader you can figure out 
implicitly how many there are (is the next record another `IMPORT`? If so keep 
reading). If you did this, I would suggest abbreviating the `IMPORT` record 
since it'll be well-structured.

For `SIGNATURE` record abbreviation, I wasn't think a blob, just something 
simple like:
```
  auto HashAbbrev = std::make_shared<BitCodeAbbrev>();
  HashAbbrev->Add(BitCodeAbbrevOp(Signature));
  for (int I = 0, E = 5; I != E; ++I)
    HashAbbrev->Add(BitCodeAbbrevOp(BitcodeAbbrevOp::Fixed, 32));
```

If you switch to 8-bit:
```
  auto HashAbbrev = std::make_shared<BitCodeAbbrev>();
  HashAbbrev->Add(BitCodeAbbrevOp(Signature));
  for (int I = 0, E = 20; I != E; ++I)
    HashAbbrev->Add(BitCodeAbbrevOp(BitcodeAbbrevOp::Fixed, 8));
```


================
Comment at: clang/lib/Serialization/ASTWriter.cpp:4729-4733
+        StringRef Name = M.Kind == MK_PrebuiltModule ||
+                                 M.Kind == MK_ExplicitModule ||
+                                 M.Kind == MK_ImplicitModule
+                             ? M.ModuleName
+                             : M.FileName;
----------------
dang wrote:
> dexonsmith wrote:
> > Is there ever a reason to use `M.FileName` here, or is that always 
> > redundant with what's in the control block? I wonder if we can just remove 
> > this complexity.
> > 
> The control block only tracks this information for direct imports although it 
> can maybe be extended to do keep track of this for all dependencies. This bit 
> of the table could then become and index inside the `IMPORTS` record in the 
> control block.
Okay. That might be a nice follow-up or prep commit. If you're not interested 
in that tangent though I suggest adding an accessor `ModuleFile::isModule` that 
you can use here.


================
Comment at: clang/lib/Serialization/ASTWriterDecl.cpp:2437
   if (DeclOffsets.size() == Index)
-    DeclOffsets.emplace_back(Loc, Offset);
+    DeclOffsets.emplace_back(Loc, Offset, ASTBlockRange.first);
   else if (DeclOffsets.size() < Index) {
----------------
dang wrote:
> dexonsmith wrote:
> > I suggest making `WriteDecls` relative to the start of the 
> > `DECLTYPES_BLOCK_ID`. That will make the block itself (more) 
> > self-contained, and also make the offsets smaller (and therefore cheaper to 
> > store in bitcode).
> > 
> > I also don't think you need to actually emit that integer, it's just an 
> > implicit contract between the reader and writer.
> Not sure what you mean by I don't actually need to emit that integer? If you 
> mean `ASTBlockRange.first` then I don't emit it but rather pass it to the 
> constructor of DeclOffset so that it performs the subtraction there. Since it 
> is a self contained type, it forces the Reader to provide a value for it, 
> making it harder to read the raw relative offset.
Ah! I misread the code.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D80383/new/

https://reviews.llvm.org/D80383



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D80383: Add AST_SIGNATURE record to unhashed control block of PCM files

Reply via email to