kbobyrev added inline comments.
================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:74
+ // symbol of the identifier.
+ if (!FoundFirstSymbol) {
+ FoundFirstSymbol = true;
----------------
ioeric wrote:
> Could this be pulled out of the loop? I think what we want is just
> `LowercaseIdentifier[0]` right?
>
> I'd probably also pulled that into a function, as the function body is
> getting larger.
Same as elsewhere, if we have `__builtin_whatever` the it's not actually the
first symbol of the lowercase identifier.
================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:87
+
+ Chars = {{LowercaseIdentifier[I], LowercaseIdentifier[J], END_MARKER,
0}};
+ const auto Bigram = Token(Token::Kind::Trigram, Chars.data());
----------------
ioeric wrote:
> I think we could be more restrictive on bigram generation. I think a bigram
> prefix of identifier and a bigram prefix of the HEAD substring should work
> pretty well in practice. For example, for `StringStartsWith`, you would have
> `st$` and `ss$` (prefix of "SSW").
>
> WDYT?
Good idea!
================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:115
+// FIXME(kbobyrev): Correctly handle empty trigrams "$$$".
std::vector<Token> generateQueryTrigrams(llvm::StringRef Query) {
// Apply fuzzy matching text segmentation.
----------------
ioeric wrote:
> It seems to me that what we need for short queries is simply:
> ```
> if (Query.empty()) {
> // return empty token
> }
> if (Query.size() == 1) return {Query + "$$"};
> if (Query.size() == 2) return {Query + "$"};
>
> // Longer queries...
> ```
> ?
That would mean that we expect the query to be "valid", i.e. only consist of
letters and digits. My concern is about what happens if we have `"u_"` or
something similar (`"_u", "_u_", "$u$"`, etc) - in that case we would actually
still have to identify the first valid symbol for the trigram, process the
string (trim it, etc) which looks very similar to what FuzzyMatching
`calculateRoles` does.
The current approach is rather straightforward and generic, but I can try to
change it if you want. My biggest concern is fighting some corner cases and
ensuring that the query is "right" on the user (index) side, which might turn
out to be more code and ensuring that the "state" is valid throughout the
pipeline.
================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:74
+/// For short queries (Query contains less than 3 letters and digits) this
+/// returns a single trigram with all valid symbols.
std::vector<Token> generateQueryTrigrams(llvm::StringRef Query);
----------------
ioeric wrote:
> I'm not quite sure what this means. Could you elaborate?
Added an example and reflected in the other comment.
https://reviews.llvm.org/D50517
_______________________________________________
cfe-commits mailing list
[email protected]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits