ioeric added a comment. In https://reviews.llvm.org/D47935#1129987, @sammccall wrote:
> Sorry for the delay on this change. There's a bunch of complexity in this > problem that I haven't seen how to slice through: > > 1. the signals needed seem like a weird fit for the Symbol*Signals structs > for some reason (maybe my misdesign) According to offline discussion, I added a structure `SymbolRelevanceContext` that captures per-query signals like proximity paths. Not sure about the name though. > 2. the inconsistency between how we do this for Sema and for Index results > has... only slightly good reasons The proximity scores for index and sema are now explicitly separated to make it easier to understand and debug. > 3. the URI vs filename thing is awkward > 4. with all this, the actual scoring still seems ad-hoc and is missing > important parts (main header, transitive includes) > > Not all your fault that the code reflects this, the problem is tangly. But > it's hard for me to reason about APIs or performance or layering. > > Looking at the last point (scoring model) because it seems the most > tractable. I think this is basically an edit distance problem? (We can call > the result "proximity", start at one, and multiply by <1, or call it > "distance" and start at 0 and add penalties, but it's equivalent). > 5. we're computing distances between files (glossing over URI-space vs > file-space) > 6. the roots are the main file, and maybe the matching header > 7. edits take us from a filepath to a related filepath: > - from a file to a header it `#include`s > - from a file to its parent directory > - from a parent directory to a child directory > - from a parent directory to a file in it > 8. the distance is the smallest sum-of-penalties for any path leading from > the root to the symbol > > What do you think of this model? > > ---- > > **If** the model seems reasonable, then it suggests an approach of building > a one-per-query data structure that computes the needed edit-distance > recursively, memoizing results. `SymbolRelevanceResults` could store the > symbol path and a pointer to the edit-distance machine, and for debugging the > machine would know how to describe its configuration. URI/path mapping > wouldn't be a performance concern (I think) if the memoization captured it. I like how this model addresses the proximity for src/ and include/ setup. I think we could start with something simple and iterate, although I agree that we should strike for a design that would be easy to replace the proximity algorithm in the future. > Let's chat offline? ================ Comment at: clangd/Quality.cpp:171 +/// "uri:///a/b/c" will be treated as /a/b/c +static float uriProximity(StringRef UX, StringRef UY) { + auto SchemeSplitX = UX.split(':'); ---------------- sammccall wrote: > ioeric wrote: > > sammccall wrote: > > > This doesn't look quite right to me. > > > We can tune the details later, but in practice this seems like it's > > > *very* hard to get zero proximity, which is our neutral score - you need > > > to be 18 directories away? > > > > > > FWIW, fozzie appears to give an additive boost proportional to 5-up, > > > where up is the number of directories from the context you have to > > > traverse up from the context to get to a parent of the symbol. (There's > > > no penalty for down-traversals probably for implementation reasons, this > > > should be smaller than the up-traversal penalty I think) > > The numbers are guessed... definitely happy to tune. > > > We can tune the details later, but in practice this seems like it's > > > *very* hard to get zero proximity, which is our neutral score - you need > > > to be 18 directories away? > > It's 18 directories away if one file is in an ancestor directories of the > > other (i.e. only traverse up or down). If you need to traverse up and down, > > the penalty for each directory is 0.1, which takes 10 directories (up+down, > > so > > 5 up in average). I think it's useful to make this distinction because I > > think it's more likely for a file to use a header if it's in the file path. > > > > > > I'm not sure if we should use zero as the neutral score. For example, if a > > codebase has deep directory structure, most scores are probably going to be > > small; conversely, most scores would be relatively big. I think relative > > scores are more useful. > > > > > (There's no penalty for down-traversals probably for implementation > > > reasons, this should be smaller than the up-traversal penalty I think) > > Why do you think down-traversal should take less penalty? > > > > If you need to traverse up and down, the penalty for each directory is 0.1, > > which takes 10 directories (up+down, so 5 up in average). > I think you've halved twice there - it still seems to be 10, which is a lot. > > > I'm not sure if we should use zero as the neutral score. > Well, zero is currently the neutral score, and this patch doesn't change it > :-) > I think starting at 1 for the current file and *multiplying* by p<1 to apply > penalties should give a reasonable 0-1 score that's relatively sane even for > codebases of different sizes. Happy to have a different model, but you need > to explain/implement how it combines with other signals. > > > Why do you think down-traversal should take less penalty? > Intuitively, because subprojects are more closely related than superprojects. > But this didn't occur to me until someone mentioned it, we should check with > Matei and Alexander. > I think you've halved twice there - it still seems to be 10, which is a lot. OK you are right. It would behave badly when there are many ups and only one down. > I think starting at 1 for the current file and *multiplying* by p<1 to apply > penalties should give a reasonable 0-1 score that's relatively sane even for > codebases of different sizes. Sounds good. Picked p=0.7 which seems to give reasonable scores. Repository: rCTE Clang Tools Extra https://reviews.llvm.org/D47935 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits