sammccall added inline comments.
================ Comment at: clang-tools-extra/clangd/index/dex/Token.cpp:25 + Data.size() == 3 && "Trigram should contain three characters."); + switch (TokenKind) { + case Kind::Trigram: ---------------- specializing the hash function looks like premature optimization (or possibly pessimisation) One the one hand, it's a perfect hash! On the other hand, when used with a power-of-two sized hashtable (particularly 256 is a fun example), all the trigrams are going to hash into the alphanumeric buckets, and other buckets will be empty! DenseMap is a power-of-two hashtable! ================ Comment at: clang-tools-extra/clangd/index/dex/Token.cpp:30 + default: + Hash = std::hash<std::string>{}(Data); + break; ---------------- llvm::hash_combine(static_cast<int>(TokenKind),Data) ================ Comment at: clang-tools-extra/clangd/index/dex/Token.h:10 +// +// Tokens are keys for inverted index which are mapped to the +// corresponding posting lists. Token objects represent a characteristic ---------------- nit: swap these sentences around? conceptual explanation first, how they're used later? could even give an example: ``` // The symbol std::cout might have the tokens: // * Scope "std::" // * Trigram "cou" // * Trigram "out" // * Type "std::ostream" ``` ================ Comment at: clang-tools-extra/clangd/index/dex/Token.h:28 + +/// Hashable Token, which represents a search token primitive, such as +/// trigram for fuzzy search on unqualified symbol names. ---------------- nit: "Hashable" isn't really important here. nit: it doesn't *represent* a search primitive, it *is* a search primitive. Maybe ``` /// A Token represents an attribute of a symbol, such as a particular /// trigram present in the name (used for fuzzy search). ``` ================ Comment at: clang-tools-extra/clangd/index/dex/Token.h:33 +/// constructing complex iterator trees. +class Token { +public: ---------------- As discussed offline: think we misunderstood each other about the separate struct. I rather meant *this* class could just be a struct. ================ Comment at: clang-tools-extra/clangd/index/dex/Token.h:36 + /// Kind specifies Token type which defines semantics for the internal + /// representation (Data field), examples of such types are: + /// ---------------- don't give examples of the enum values immediately before defining the enum values :-) Just put the comments on the values themselves, and omit the ones that aren't here yet (or leave a TODO) ================ Comment at: clang-tools-extra/clangd/index/dex/Token.h:47 + /// + /// * Trigram: 3 bytes containing trigram characters + /// * Scope: full scope name, such as "foo::bar::baz::" or "" (global scope) ---------------- move these to enum values ================ Comment at: clang-tools-extra/clangd/index/dex/Token.h:58 + + Token(llvm::StringRef Data, Kind TokenKind); + ---------------- nit: probably reverse the order here since Token(TRIGRAM, "str") reads more naturally than Token("str", TRIGRAM) - context is present to understand the data ================ Comment at: clang-tools-extra/clangd/index/dex/Token.h:60 + + // Returns precomputed hash. + size_t hash(const Token &T) const { return Hash; } ---------------- can we just use the standard hash_value() function? Why this accessor? ================ Comment at: clang-tools-extra/clangd/index/dex/Token.h:79 + /// Precomputed hash which is used as a key for inverted index. + size_t Hash; + Kind TokenKind; ---------------- as discussed offline: this precomputed hash looks like premature optimization to me. ================ Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:35 + llvm::DenseSet<Token> UniqueTrigrams; + std::vector<Token> Trigrams; + ---------------- just iterate at the end and collect them? order shouldn't matter here. ================ Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:37 + + // Extract trigrams consisting of first characters of tokens sorted bytoken + // positions. Trigram generator is allowed to skip 1 word between each token. ---------------- ah, you're also handling this by cases here. This doesn't seem necessary. In fact I think there's a fairly simple way to express this that works naturally with the segmentation structure produced by fuzzymatch (so doesn't need a version of segmentIdentifier to wrap it). But it's taking me too long to work out the details, will write a followup comment or discuss offline. ================ Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:46 + // this case. + for (auto FirstSegment = Segments.begin(); FirstSegment != Segments.end(); + ++FirstSegment) { ---------------- please use short names like I, S1, etc for iterators and loop counters. ================ Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:90 + ++SecondSegment) { + for (size_t FirstSegmentIndex = 0; + FirstSegmentIndex < FirstSegment->size(); ++FirstSegmentIndex) { ---------------- There seems to be a lot of overlap between this one and the third loop. ================ Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:10 +// +// T +// ---------------- This would be a great place to describe *how* trigrams are used by fuzzy matching (the idea that you generate trigrams T for the identifier, trigrams Q for the query, and require Q ⊆ T) ================ Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:56 + +/// Returns list of unique fuzzy-search trigrams from unqualified symbol. +/// ---------------- Per the examples above, segmentIdentifier doesn't case-fold its output, but all the inputs shown here are lowercase. Is the caller expected to do that in between calling here, or does generateTrigrams case-fold as part of generation? (please doc) ================ Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:58 +/// +/// Runs trigram generation for fuzzy-search index on segments produced by +/// segmentIdentifier(SymbolName); ---------------- seem to have two summaries here, drop or merge with previous? ================ Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:64 +/// are 3-char suffixes of paths through the fuzzy matching automaton. There are +/// four classes of extracted trigrams: +/// ---------------- This is true, but I think treating these as four cases doesn't give a clear understanding of how they're related. What about: ``` Trigrams can start at any character in the input. Then we can choose to move to the next character, move to the start of the next segment, or skip over a segment. For input [abstract factory producer impl]: * the simplest trigrams are consecutive letters from one segment (abs, bst, ..., mpl) * but may cross two segments (abf, bpr) * or even three (api, ypi) https://reviews.llvm.org/D49591 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits