ioeric added a comment.

In https://reviews.llvm.org/D47935#1129987, @sammccall wrote:

> Sorry for the delay on this change. There's a bunch of complexity in this 
> problem that I haven't seen how to slice through:
>
> 1. the signals needed seem like a weird fit for the Symbol*Signals structs 
> for some reason (maybe my misdesign)


According to offline discussion, I added a structure  `SymbolRelevanceContext`  
that captures per-query signals like proximity paths. Not sure about the name 
though.

> 2. the inconsistency between how we do this for Sema and for Index results 
> has... only slightly good reasons

The proximity scores for index and sema are now explicitly separated to make it 
easier to understand and debug.

> 3. the URI vs filename thing is awkward
> 4. with all this, the actual scoring still seems ad-hoc and is missing 
> important parts (main header, transitive includes)
> 
>   Not all your fault that the code reflects this, the problem is tangly. But 
> it's hard for me to reason about APIs or performance or layering.
> 
>   Looking at the last point (scoring model) because it seems the most 
> tractable. I think this is basically an edit distance problem? (We can call 
> the result "proximity", start at one, and multiply by <1, or call it 
> "distance" and start at 0 and add penalties, but it's equivalent).
> 5. we're computing distances between files (glossing over URI-space vs 
> file-space)
> 6. the roots are the main file, and maybe the matching header
> 7. edits take us from a filepath to a related filepath:
>   - from a file to a header it `#include`s
>   - from a file to its parent directory
>   - from a parent directory to a child directory
>   - from a parent directory to a file in it
> 8. the distance is the smallest sum-of-penalties for any path leading from 
> the root to the symbol
> 
>   What do you think of this model?
> 
>   ----
> 
>   **If** the model seems reasonable, then it suggests an approach of building 
> a one-per-query data structure that computes the needed edit-distance 
> recursively, memoizing results. `SymbolRelevanceResults` could store the 
> symbol path and a pointer to the edit-distance machine, and for debugging the 
> machine would know how to describe its configuration. URI/path mapping 
> wouldn't be a performance concern (I think) if the memoization captured it.

I like how this model addresses the proximity for src/ and include/ setup. I 
think we could start with something simple and iterate, although I agree that 
we should strike for a design that would be easy to replace the proximity 
algorithm in the future.

> Let's chat offline?



================
Comment at: clangd/Quality.cpp:171
+/// "uri:///a/b/c" will be treated as /a/b/c
+static float uriProximity(StringRef UX, StringRef UY) {
+  auto SchemeSplitX = UX.split(':');
----------------
sammccall wrote:
> ioeric wrote:
> > sammccall wrote:
> > > This doesn't look quite right to me.
> > > We can tune the details later, but in practice this seems like it's 
> > > *very* hard to get zero proximity, which is our neutral score - you need 
> > > to be 18 directories away?
> > > 
> > > FWIW, fozzie appears to give an additive boost proportional to 5-up, 
> > > where up is the number of directories from the context you have to 
> > > traverse up from the context to get to a parent of the symbol. (There's 
> > > no penalty for down-traversals probably for implementation reasons, this 
> > > should be smaller than the up-traversal penalty I think)
> > The numbers are guessed... definitely happy to tune.
> > > We can tune the details later, but in practice this seems like it's 
> > > *very* hard to get zero proximity, which is our neutral score - you need 
> > > to be 18 directories away?
> > It's 18 directories away if one file is in an ancestor directories of the 
> > other (i.e. only traverse up or down). If you need to traverse up and down, 
> > the penalty for each directory is 0.1, which takes 10 directories (up+down, 
> > so 
> > 5 up in average). I think it's useful to make this distinction because I 
> > think it's more likely for a file to use a header if it's in the file path. 
> >  
> > 
> > I'm not sure if we should use zero as the neutral score. For example, if a 
> > codebase has deep directory structure, most scores are probably going to be 
> > small; conversely, most scores would be relatively big. I think relative 
> > scores are more useful.
> > 
> > > (There's no penalty for down-traversals probably for implementation 
> > > reasons, this should be smaller than the up-traversal penalty I think)
> > Why do you think down-traversal should take less penalty?
> > 
> > If you need to traverse up and down, the penalty for each directory is 0.1, 
> > which takes 10 directories (up+down, so 5 up in average).
> I think you've halved twice there - it still seems to be 10, which is a lot.
> 
> > I'm not sure if we should use zero as the neutral score.
> Well, zero is currently the neutral score, and this patch doesn't change it 
> :-)
> I think starting at 1 for the current file and *multiplying* by p<1 to apply 
> penalties should give a reasonable 0-1 score that's relatively sane even for 
> codebases of different sizes. Happy to have a different model, but you need 
> to explain/implement how it combines with other signals.
> 
> > Why do you think down-traversal should take less penalty?
> Intuitively, because subprojects are more closely related than superprojects. 
> But this didn't occur to me until someone mentioned it, we should check with 
> Matei and Alexander.
> I think you've halved twice there - it still seems to be 10, which is a lot.
OK you are right. It would behave badly when there are many ups and only one 
down.

> I think starting at 1 for the current file and *multiplying* by p<1 to apply 
> penalties should give a reasonable 0-1 score that's relatively sane even for 
> codebases of different sizes. 
Sounds good. Picked p=0.7 which seems to give reasonable scores.


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D47935



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to