[PATCH] D114966: [clang][deps] Split stat and file content caches

Jan Svoboda via Phabricator via cfe-commits Tue, 21 Dec 2021 07:56:04 -0800

jansvoboda11 added inline comments.


================
Comment at: 
clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h:359-360
+  findSharedEntryByUID(llvm::vfs::Status Stat) const {
+    return SharedCache.getShard(Stat.getName())
+        .findEntryByUID(Stat.getUniqueID());
+  }
----------------
dexonsmith wrote:
> This doesn't look right to me. UIDs should be sharded independently of the 
> filename they happen to have been reached by; otherwise each filename shard 
> is developing its own idea of what each UID means. Since UID distribution is 
> not uniform, probably the UID shard should be chosen by 
> `hash_value(Stat.getUniqueID()) % NumShards`.
> 
> You could use the same sets of shards for UIDMap and FilenameMap, but since 
> they're independent I'd probably do:
> - UIDCache: sharded by UID: UIDMap and BumpPtrAllocator for entries (and 
> likely anything else tied to content)
> - FilenameCache: sharded by filename: FilenameMap (and perhaps other things 
> tied to filename?)
Hmm, skimming through `RealFileSystem::status`, I saw that it's calling 
`sys::fs::status` with "follow symlinks" enabled. It made sense to me that the 
name stored in `llvm::vfs::Status` would match that and refer the fully 
resolved target entry, not the symlink itself. Seeing as this is not the case, 
I agree the UID itself should be used for choosing the shard.


================
Comment at: 
clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp:23-24
+  auto MaybeFile = getUnderlyingFS().openFileForRead(Filename);
   if (!MaybeFile)
     return MaybeFile.getError();
   auto File = std::move(*MaybeFile);
----------------
dexonsmith wrote:
> In what circumstances should this return a cached-error TentativeEntry? Any?
I don't think the distinction matters at this level. Whether failures should be 
cached is a decision that's being made one level up.

I personally prefer having `TentativeEntry` to be "non-fallible" and explicitly 
wrapping the whole thing in `ErrorOr`. That makes it easier for others to know 
what they are working with (i.e. this object cannot represent an error state). 
Eventually, I think this would make sense for the caches and 
`CachedFileSystemEntry` too.


================
Comment at: 
clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp:28-29
   auto MaybeStat = File->status();
   if (!MaybeStat)
     return MaybeStat.getError();
   auto Stat = std::move(*MaybeStat);
----------------
dexonsmith wrote:
> Since the file was opened, should we return cached-error TentativeEntry here, 
> rather than an error?
Let's return an error here and create error `CachedFileSystemEntry` one level 
up.


================
Comment at: 
clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp:275
+  const auto &Entry1 = getOrEmplaceSharedEntryForUID(std::move(TEntry));
+  const auto &Entry2 = getOrInsertSharedEntryForFilename(Filename, Entry1);
+  return insertLocalEntryForFilename(Filename, Entry2);
----------------
dexonsmith wrote:
> jansvoboda11 wrote:
> > I'm not sure these should be separate. We could end up in situation where 
> > the Filename map contains different entry than the UID map for the same 
> > directory entry. I'm tempted to merge these functions into one and perform 
> > the updates in a single critical section...
> > I'm not sure these should be separate. We could end up in situation where 
> > the Filename map contains different entry than the UID map for the same 
> > directory entry.
> 
> I'm also sure precisely what you mean by "for the same directory entry" in 
> this context; and I don't see what's wrong with the situation I think you're 
> outlining.
> 
> > I'm tempted to merge these functions into one and perform the updates in a 
> > single critical section...
> 
> A single critical section for setting UID and filename at the same time would 
> be hard to get right (and efficient), since UIDs have aliases through other 
> filenames due to different directory paths (dir/../x.h vs x.h) and filesystem 
> links (hard and symbolic).
> 
> Here's the race that I think(?) you're worried about:
> 
> - Worker1 does a tentative stat of "x.h", finds a UID that isn't mapped 
> (UIDX1, but it's ignored...).
> - Worker2 does a tentative stat of "x.h", finds a UID that isn't mapped 
> (UIDX1, but it's ignored...).
> - Worker1 opens "x.h", finds ContentX1+StatX1 (with UIDX1), saves mapping 
> UIDX1 -> ContentX1+StatX1.
> - "x.h" changes.
> - Worker2 opens "x.h", finds ContentX2+StatX2 (with UIDX2), saves mapping 
> UIDX2 -> ContentX2+StatX2.
> - Worker2 saves mapping "x.h" -> ContentX2+StatX2.
> - Both workers move forward with "x.h" -> ContentX2+StatX2.
> 
> IIUC, you're concerned that the mapping UIDX1 -> ContentX1+StatX1 was saved. 
> The side effect is that if a future tentative stat of (e.g.) "y.h" returns 
> UIDX1, then "y.h" will be mapped to ContentX1+StatX1. Is this what concerns 
> you? Why? (Is it something else?)
> 
> The concern I have is that some filesystems recycle UIDs (maybe "x.h" *was* a 
> symbolic link to "y.h" and then became its own file... or maybe "x.h" and 
> "y.h" were hard links... or maybe "y.h" is just a new file!). But that's a 
> problem with using UIDs to detect equivalent filesystem links / content in 
> general. I don't see any reason to be more concerned here than elsewhere, and 
> to avoid depending on UID we'd need a pretty different design (e.g., lazily 
> detect and model directory structure and symbolic links).
Yes, that's the kind of scenario I was thinking about. I'm not concerned about 
consequences of that side effect, I just don't like storing garbage that will 
most likely never be used/referenced again and might be confusing during 
debugging.

I agree with you on UID recycling...


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114966/new/

https://reviews.llvm.org/D114966

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D114966: [clang][deps] Split stat and file content caches

Reply via email to