sammccall accepted this revision. sammccall added a comment. This revision is now accepted and ready to land.
LG, thank you! Bunch of nits, up to you what you want to do with them. ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h:10 +// LR parsers are bottom-up parsers -- they scan the input from left to right, +// and find the right-hand side of a production rule (called handle), and +// replace the handle with the nonterminal defined by the production rule. ---------------- I think it'd be useful to be a little more concrete here than "find"... ``` and collect the right-hand side of a production rule (called handle) on top of the stack, then replace (reduce) the handle with the nonterminal defined by the production rule. ``` ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h:42 +// An LR item -- a grammar rule with a dot at some position of the body. +// e.g. a production rule A := xy yields 3 items: +// A := .xy ---------------- The lack of spaces on the RHS is inconsistent with the debug output, and any examples with real grammars (since we need spaces to split up words). Can we use `A := x y` instead? Or possibly `A := X Y` which makes the boundaries more obvious when it appears in-line with text. (I don't have an opinion about `A := X . Y` vs `A := X.Y` (dot *replaces* space), but again we should pick one consistent with the debug output. ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h:52 +public: + Item(RuleID ID, uint8_t DotPos, uint8_t RuleLength) + : RID(ID), DotPos(DotPos), RuleLength(RuleLength) {} ---------------- I think the calls to these constructors could be a little clearer as factories that describe their purpose: e.g. ``` static Item start(RuleID ID, const Grammar &G); static Item sentinel(uint8_t ID); Item Item::advance() const; ``` I think that would cover everything? ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h:85 +}; +using ItemSet = std::vector<Item>; + ---------------- this alias is used only once in the header, for Items - it doesn't really introduce a new public concept that's distinct from that field. I think it would be clearer to write it as std::vector<Item> there, and move the alias to the implementation file. ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h:101 + std::string dump(const Grammar &G) const; + struct LessItem { + LessItem(const Grammar &G) : G(G) {} ---------------- call this "SortByNextSymbol"? or something? As it is it's hard at a glance to understand why we have operator< and LessItem ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h:101 + std::string dump(const Grammar &G) const; + struct LessItem { + LessItem(const Grammar &G) : G(G) {} ---------------- sammccall wrote: > call this "SortByNextSymbol"? or something? > > As it is it's hard at a glance to understand why we have operator< and > LessItem This comparator is pure implementation detail, and can go in the cpp file. (Still fine to reference in the comment, but I think "in a canonical order" is enough) ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h:110 + // Item with a trailing dot is considered minimal. + return L.hasNext() < R.hasNext(); + } ---------------- this seems to compare items equal if both are dot-at-the-end. In general the cascade/priority/completeness of comparisons isn't really easy to spot. Consider structuring like: ``` if (L.hasNext() && R.hasNext() && R.Lnext(G) != R.next(G)) return L.next(G) < R.next(G); if (L.hasNext() != R.hasNext()) return L.hasNext() < R.hasNext(); // trailing dot is minimum return L < R; ``` ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h:131 + + // Constructs an LR(0) automaton. + static LRGraph buildLR0(const Grammar &); ---------------- IIUC the data structure used here is inherently LR0, rather than the choice of how we build it. (Vs in the LRTable where we could come up with data representing a full LR(1) parser without changing the structure). So I think this function doesn't need LR0 in its name (but the class could be LR0Graph if you like)? ================ Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:36 + llvm::DenseSet<Item> Closure = {IS.begin(), IS.end()}; + while (!IS.empty()) { + auto ProcessingItem = std::move(IS.back()); ---------------- This took me a while to understand: you're reusing the ItemSet (passed by value) as the work queue (well, stack) and it's already initialized to the right elements. This is clever, but please add a comment. Maybe even rename IS->Queue ================ Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:37 + while (!IS.empty()) { + auto ProcessingItem = std::move(IS.back()); + IS.pop_back(); ---------------- nit: auto -> Item? (Generally if the types aren't hard to write or read, it seems nice to include them) ================ Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:44 + + if (pseudo::isToken(NextSym)) + continue; ---------------- nit: I find the spacing (blank lines) here a bit confusing: this seems part of the previous "paragraph" rather than the following one. Could use a comment like "Is there a nonterminal next() to expand?" ================ Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:53 + } + ItemSet Sorted = {Closure.begin(), Closure.end()}; + llvm::sort(Sorted, State::LessItem(G)); ---------------- You still have the `IS` array sitting there, you could reuse that again and avoid the allocation if you didn't expand the set too much :-) ================ Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:84 + for (const Item &I : Batch) + Next.push_back(Item(I.rule(), I.dot() + 1, G)); + // sort the set to keep order determinism for hash computation. ---------------- Maybe call this Item.advance(), to avoid exposing the constructor ================ Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:102 + }; + return llvm::formatv("{0} := {1} • {2}\n", G.symbolName(Rule.Target), + llvm::join(ToNames(Rule.seq().take_front(DotPos)), " "), ---------------- Since this is a single-line output, it seems more flexible to put the caller in charge of adding a newline if they want one? (This would be inconsistent with State::dump of course, but I think that's OK. Maybe we should work out a naming convention at some point) ================ Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:112 + for (const auto &Item : Items) + OS << llvm::formatv(" {0}", Item.dump(G)); + return OS.str(); ---------------- Hardcoding the indentation seems a bit special-purpose. Make the indent level an optional parameter to State::dump() and use OS.indent()? ================ Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:126 + for (const auto &E : Edges) { + OS << llvm::formatv("{0}->[{1}]{2}\n", E.Src, G.symbolName(E.Label), E.Dst); + } ---------------- nit: more spaces would make this more readable `{0} ->[{1}] {2}` or even `{0} --[{1}]-> {2}`? ================ Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:168 + private: + // We store the hash codes, as collisions is rare and ItemSet is heavy. + // key is the hash value of **kernel** item sets. ---------------- This really feels like a dubious optimization to me as it's *incorrect* if we ever get a hash collision, and the kernel ItemSet is much smaller than `States` which we're already storing. This seems like it must only save 20% or something of overall size? In the other review thread you compared the size of DenseMap<hash_code, size_t> vs DenseMap<ItemSet, size_t>, but really the total size including `States` seems like a more useful comparison. Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D119172/new/ https://reviews.llvm.org/D119172 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits