On Thu, Jul 11, 2024 at 04:16:25PM -0400, Frank Ch. Eigler wrote: > Hi, Omar - > > Thanks. I wish this sort of amazing kludge weren't necessary, but > given that it helps, so be it. > > I'd like to commend you on the effort needed to match your code up > with the stylistic idiosyncracies of the debuginfod c++ code. It > looks just like the other code. My only reservation is the schema > change. Reindexing some of our large repos takes WEEKS. Here's a > possible way to avoid that: > > - Preserve the current BUILDID schema id and tables as is. > > - Add a new table for the intra-archive coordinates. Think of it like a > cache. > Index it with archive-file-name and content-file-name (source0, source1 > IIRC). > > - During a fetch out of the archive-file-name, check whether the new > table has a record for that file. If yes, cache hit, go through to > the xz extraction stuff, winner! > > - If not, try the is_seekable() check on the archive. If it is true, we have > an > archive that should be seekable, but we don't have it in the intra-archive > cache. > So take this opportunity to index that archive (only), populate the cache > table, > as the archive is being extracted. (No need to use the new cache data > then, since > we've just paid the effort of decompressing/reading the whole thing > already.) > > - Need to confirm that during grooming, a disappeared > archive-file-name would also drop the corresponding intra-archive > rows. > > - Heck, during grooming or scanning, maybe the tool could preemptively > do the intra-archive coordinate cache thing if it's not already > done, just to defeat the latency of doing it on demand. > > > What do you think?
Hi, Frank, I didn't realize how expensive reindexing could be, thank you for pointing that out. Your proposal makes sense to me, I'll rework this. Thanks, Omar