ioeric added inline comments.
================ Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:33 + +void insertChars(DenseSet<Token> &UniqueTrigrams, std::string Chars) { + const auto Trigram = Token(Token::Kind::Trigram, Chars); ---------------- This is probably neater as a lambda in `generateIdentifierTrigrams `, e.g. ``` auto add = [&](std::string Chars) { trigrams.insert(Token(Token::Kind::Trigram, Chars)); } ... add("abc"); ``` ================ Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:95 + + // Generate trigrams only if the first character is the segment start. + // Example: "StringStartsWith" would yield "st$", "ss$" but not "tr$". ---------------- Do you mean *bigrams* here? ================ Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:96 + // Generate trigrams only if the first character is the segment start. + // Example: "StringStartsWith" would yield "st$", "ss$" but not "tr$". + if (Roles[I] == Head) { ---------------- Should we also check `Roles[J] == Head `? As bigram posting lists would be significantly larger than those of trigrams, I would suggest being even more restrictive. For example, for "AppleBananaCat", the most common short queries would be "ap" and "ab" (for `AB`). ================ Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:119 -// FIXME(kbobyrev): Similarly, to generateIdentifierTrigrams, this ignores short -// inputs (total segment length <3 characters). +// FIXME(kbobyrev): Correctly handle cases like "u_". In this case, we would +// like to match "u" only. This might also work better with word bounds for ---------------- Couldn't we generate a bigram "u_$" in this case? I think we can assume prefix matching in this case, if we generate bigram "u_" for identiifers like "u_*". > In this case, we would like to match "u" only. Why? If user types "_", I would expect it to be a meaning filter. ================ Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:141 + // sufficient, generate a single incomplete trigram for query. + if (ValidSymbolsCount < 3) { + std::string Symbols = {{END_MARKER, END_MARKER, END_MARKER}}; ---------------- For queries like `__` or `_x`, I think we can generate tokens "__$" or `_x$`. ================ Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:49 +/// result. Incomplete trigrams contain END_MARKER ('$') at the end. The result +/// also contains bigrams which are either two subsequent Head symbols or Head +/// and subsequent Tail. This means that for short queries Dex index would do ---------------- nit: the term "symbol" here is confusing. Do you mean "character"? ================ Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:59 +/// * Bigram with the first two symbols (if availible), same logic applies. +/// /// Note: the returned list of trigrams does not have duplicates, if any trigram ---------------- I think the comment can be simplified a bit: ``` This also generates incomplete trigrams for short query scenarios: * Empty trigram: "$$$" * Unigram: the first character of the identifier. * Bigrams: a 2-char prefix of the identifier and a bigram of the first two HEAD characters (if it exists). ``` https://reviews.llvm.org/D50517 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits