[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

Sam McCall via Phabricator via cfe-commits Wed, 02 Feb 2022 09:34:36 -0800

sammccall added a comment.

A few initial comments just on the "easy parts" - changes to grammar and the 
first/follow set computation.


I need to study the rest. Generally I have a feeling some of the impl is very 
inefficient but efficiency may not matter much in this context, and it's 
unclear how much better it can be made without adding lots of complexity. Going 
to look more closely :-)



================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h:128
+  // Returns all symbols.
+  std::vector<SymbolID> symbols() const;
+
----------------
This function seems like a bit of an attractive nuisance: it could be a cheap 
accessor, but isn't. There are just two callers - both are calling it in a loop 
and both seem dubious to be iterating over all symbols at all. I wonder if we 
can avoid it entirely.


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h:130
+
+  // Returns the SymbolID of the augmented symbol '_'
+  SymbolID augmentedSymbol() const { return 0; }
----------------
this "augmented symbol" name doesn't seem widely used. (search for `lr 
"augmented symbol"` produces few results, mostly unrelated). It's also a 
confusing name, because it's IIUC it's not the symbol that's augmented, but 
rather the grammar has been augmented with the symbol...

I'd suggest just calling it the start symbol, unless the distinction is 
critical (I don't yet understand why it's important - I added it in the 
prototype just to avoid having the language-specific name "translation-unit" 
hardcoded).

Otherwise it's hard to suggest a good name without understanding what it's for, 
but it should be descriptive...


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h:133
+
+  // Computes the FIRST(X) for all non-terminal symbol X.
+  std::vector<llvm::DenseSet<SymbolID>> firstSets() const;
----------------
In about the same number of words you could explain what this actually is :-)

// Computes the set of terminals that could appear at the start of each 
non-terminal.


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h:133
+
+  // Computes the FIRST(X) for all non-terminal symbol X.
+  std::vector<llvm::DenseSet<SymbolID>> firstSets() const;
----------------
sammccall wrote:
> In about the same number of words you could explain what this actually is :-)
> 
> // Computes the set of terminals that could appear at the start of each 
> non-terminal.
is there some reason these functions need to be part of the grammar class?

They seem like part of the LR building to me, it could be local to 
LRBuilder.cpp, or exposed from LRTable.h or so for testing


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRAutomaton.h:66
+
+using ItemSet = std::set<Item>;
+// A state of the LR automaton is an item set, which contains all possible 
rules
----------------
why is std::set used here? (generally a data structure to avoid)


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRAutomaton.h:75
+// precompute all the transitions between these states.
+using State = ItemSet;
+llvm::hash_code hash_code(const State &S);
----------------
I have a feeling we've discussed this before but...

I find this alias/abstraction unhelpful when reading the code, because 
operations on states are e.g. `state.insert` or `state.contains`, which don't 
match the abstraction.

I'd probably prefer just using `ItemSet`, but alternatively `struct State { 
ItemSet Items; }` seems like it would work well: `State.Items.contains(...)` 
reads very naturally.


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp:39
 
+std::vector<llvm::DenseSet<SymbolID>> Grammar::firstSets() const {
+  std::vector<llvm::DenseSet<SymbolID>> FirstSets(T->Nonterminals.size());
----------------
this function could use some comments.

```
A rule S := T ... implies elements in first(S):
 - if T is a terminal, first(S) contains T
 - if T is a nonterminal, first(S) contains first(T)
Since first(T) may not have been fully computed yet,
first(S) itself may end up being incomplete.
We iterate until a fixed point.
(This isn't particularly efficient, but table building performance isn't 
critical).
```


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp:51
+
+  bool Changed = false;
+  do {
----------------
`bool Changed = true; while(Changed) {`? or maybe `for (bool Changed = true; 
Changed;) {`?

(do-while is rare enough that it always takes me a bit to grasp the control 
flow)



================
Comment at: clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp:56
+      assert(R.size() > 0);
+      Changed |= ExpandFirstSet(R.target(), R.seq().front());
+    }
----------------
comment that we only need to consider the first element because symbols are 
non-nullable


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp:88
+    for (const auto &R : T->Rules) {
+      // Rule 2: for a rule X := ...Yz, we add all symbols from FIRST(z) to
+      // FOLLOW(Y).
----------------
nit: `... Y Z`, with consistent spaces/capitalization
Unless case is meant to imply something here?


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp:95
+        // First set for a non-terminal is itself.
+        auto First =
+            isToken(Next) ? llvm::DenseSet<SymbolID>{Next} : FirstSets[Next];
----------------
copying the set here seems gratuitous, especially since we're doing this in a 
loop where !changed is the usual thing.

I think inlining ExpandFollowSet for each case would make sense here. It turns 
into 1 line for the NT case plus two lines for the T case, so it's actually 
shorter overall


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp:101
+      if (!isNonterminal(Z))
+        continue;
+      // Rule 3: for a rule X := ....Z, we add all symbols from FOLLOW(X) to
----------------
I think `if (isNonterminal(Z)) ...expand...` would be clearer here, since the 
condition is really part of rule 3 (but in a not-totally obvious way, it's nice 
to see them side-by-side)


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp:80
+}
+// For all items A := a.Xb in S, returns a closure itemset with a dot
+// advance of X.
----------------
nit: please be consistent about spacing & capitalization for rules

I think trying to use a convention of uppercase/lowercase for term/nonterm is 
difficult because there's no clear way to indicate that a symbol could be 
either, which is common.


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp:80
+}
+// For all items A := a.Xb in S, returns a closure itemset with a dot
+// advance of X.
----------------
sammccall wrote:
> nit: please be consistent about spacing & capitalization for rules
> 
> I think trying to use a convention of uppercase/lowercase for term/nonterm is 
> difficult because there's no clear way to indicate that a symbol could be 
> either, which is common.
I can't really understand this description, it'd be helpful if it didn't reuse 
the name "advance" (particularly as a noun)


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp:153
+    PendingStates.pop_back();
+    for (SymbolID Move : G.symbols()) {
+      State NextState = advance(Builder.find(CurrentStateID), Move, G);
----------------
this loop looks like it's iterating over every possible symbol that might 
appear next, rather than just the ones that do (and creating a temporary array 
of 1000 items each outer iteration to do it).

Seems like we could do better by instead advancing every rule, and spitting out 
a list of {symbol, ItemSet} pairs?


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp:157
+        continue;
+      auto Insert = Builder.insert(NextState);
+      if (Insert.second) // new state, insert to the processing.
----------------
Hmm, and advance() is computing closures from scratch every time, probably just 
to find "oh, the state already existed". Seems like we could save work by using 
the kernel set to represent the full set until we actually need to iterate over 
its items.

I'd like to think about this some more.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118196/new/

https://reviews.llvm.org/D118196

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

Reply via email to