[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

Sam McCall via Phabricator via cfe-commits Fri, 11 Feb 2022 09:33:21 -0800

sammccall added a comment.

This is really nice, mostly nits.




================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h:177
 };
+void initTerminals(std::vector<std::string> &Terminals);
 
----------------
why is this exposed/required rather than being initialized by the GrammarTable 
constructor?
Since this is essentially static (must always correspond to tok::TokenKind) it 
seems that GrammarTable could just have an ArrayRef and it could be initialized 
by a lazy-init singleton:

```
// in Grammar.cpp
static ArrayRef<std::string> getTerminalNames() {
  static std::vector<std::string> *TerminalNames = []{

  };
  return *TerminalNames;
}
```

(I know eventually we'd like GrammarTable to be generated and have very minimal 
dynamic init, but there are lots of other details needed e.g. we can't 
statically initialize `vector<Rule>` either, so we should cross that bridge 
when we come to it)


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:28
+//  LRTable is *performance-critial* as it is consulted frequently during a
+//  parse. In general, LRTable is very sprase (most of the entries are empty).
+//  For example, for the C++ language, the SLR table has ~1500 states and 650
----------------
space -> sparse


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:58
+// Unlike A typical LR parsing table allows at most one available action per
+// entry, conflicted actions are allowed in LRTable.
+class LRTable {
----------------
Maybe mention relation to GLR: GLR can execute these in parallel?


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:67
+    enum Kind : uint8_t {
+      // Terminal actions, corresponding to entries of ACTION table.
+      Error = 0,
----------------
nit: I think you want this to be a comment on the section rather than on Error, 
leave a blank line?


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:68
+      // Terminal actions, corresponding to entries of ACTION table.
+      Error = 0,
+      // Shift to state n: move forward with the lookahead, and push state n
----------------
Error seems like an action we'd dynamically handle, rather than an unreachable 
sentinel.

I'd prefer to call this sentinel and llvm_unreachable() on it in the relevant 
places. Even if we do plan dynamic error actions, we have enough spare bits to 
keep these two cases separate.


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:80
+      // Nonterminal actions, corresponding to entry of GOTO table.
+      // Go to state n: pust state n onto the state stack.
+      // Similar to Shift, but it is a nonterminal forward transition.
----------------
again blank line between "nonterminal actions" and "go to state n"


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:80
+      // Nonterminal actions, corresponding to entry of GOTO table.
+      // Go to state n: pust state n onto the state stack.
+      // Similar to Shift, but it is a nonterminal forward transition.
----------------
sammccall wrote:
> again blank line between "nonterminal actions" and "go to state n"
pust -> push?

I thought goto replaces the top of the stack rather than being pushed onto it.

e.g.
stack is { stmt := . expr ; }, lookahead is IDENT, action is shift "expr := 
IDENT ."
stack is { stmt := . expr ; | expr := IDENT . }, lookahead is semi, action is 
reduce
stack is { stmt := . expr ; }, reduced symbol is expr, goto is state "stmt := 
expr . ;"
stack is { stmt := expr . ;}, lookahead is semi...

Line 3=>4 doesn't seem like a push
 



================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:106
+    bool operator==(const Action &L) const { return value() == L.value(); }
+    uint16_t value() const { return K << ValueBits | Value; };
+
----------------
maybe value()=>opaque() or asInteger()?

Partly to give a hint a bit more at the purpose, partly to avoid confusion of 
`Value != value()`


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:114
+    // Either StateID or RuleID, depending on the Kind.
+    uint16_t Value : ValueBits;
+  };
----------------
static assert RuleBits <= ValueBits, and I think we want a StateBits above plus 
some assertions on the number of states when building


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:114
+    // Either StateID or RuleID, depending on the Kind.
+    uint16_t Value : ValueBits;
+  };
----------------
sammccall wrote:
> static assert RuleBits <= ValueBits, and I think we want a StateBits above 
> plus some assertions on the number of states when building
FWIW I've mostly seen just `unsigned` as the field type for bitfields. It seems 
a little confusing to explicitly specify the size both as 16 and 13 bits.

Up to you, but if you change this one also consider the Kind enum.


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:136
+
+  class Builder {
+  public:
----------------
This is quite a lot of surface (together with the DenseMapInfo stuff) to expose.
Currently it looks like it needs to be in the header so that the LRTableTest 
can construct a table directly rather than going through LRGraph.

I think you could expose a narrower interface:
```
struct Entry { ;... };
// Specify exact table contents for testing.
static LRTable buildForTests(ArrayRef<Entry>);
```

Then the builder/densemapinfo can be moved to the cpp file, and both 
`buildForTests` and `buildSLR` can use the builder internally. WDYT?


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:154
+private:
+  // Index is nonterminal SymbolID, values are indices of StateIdx.
+  // Give a non-terminal id, the corresponding half-open range of StateIdx is
----------------
maybe "value is the offset into States/Actions where the entries for this 
symbol begin".



================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:163
+  // Grouped by the SymbolID, and only subranges are sorted.
+  std::vector<StateID> StateIdx;
+  // A flat list of available (non-error) actions.
----------------
Reading the building/lookup code, "index" and "idx" are used so often they're 
hard to follow.

I think calling this "States", parallel to "Actions", would be clearer.

Similarly maybe the NontermIdx -> NontermOffset (Offset into States/Actions 
arrays where we find entries for this nonterminal).
Then we can just use the word "index" for actual subscripts.


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:165
+  // A flat list of available (non-error) actions.
+  // Sorted by (SymbolID, State), concepetually, an action is identified by a
+  // row (State) and a column (SymbolID) in a matrix.
----------------
concepetually -> conceptually.

I think mixing this comment in with the concrete data is confusing.
Maybe lift it to the top of the private section like:

```
// Conceptually this is a multimap from (State, SymbolID) => action.
// In the literature this is often a table (and multiple entries, i.e. 
conflicts are forbidden).
// Our physical representation is quite different for compactness.
```


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp:25
+  case LRTable::Action::Error:
+    return OS << "err";
+  case LRTable::Action::Shift:
----------------
as things currently stand this should be unreachable - these values should 
never escape the DenseMap. assert or llvm_unreachable?


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp:97
+  auto Result = find(State, Nonterminal);
+  assert(Result.size() == 1 && Result.front().kind() == Action::GoTo);
+  return Result.front().getGoToState();
----------------
(moving the comment thread to the new code location)

The DFA input guarantees Result.size() <= 1, but why can't it be zero?
If this is a requirement on the caller, mention it in the header?



================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp:162
+  // default.
+  Table.TerminalIdx.resize(GT.Terminals.size() + 1, 0);
+  Table.NontermIdx.resize(GT.Nonterminals.size() + 1, 0);
----------------
nit: "assign" seems clearer than "resize"


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp:165
+  llvm::ArrayRef<Entry> SortedEntries = Sorted;
+  while (!SortedEntries.empty()) {
+    // We iterate each symbol per time (conceptually a column in the matrix per
----------------
We end up with holes because we're looping over the values rather than the 
keys. This also feels quite indirect.

Seems clear enough to reverse this, something like:

```
unsigned Pos = 0;
for (unsigned NT = 0; TK < GT.Nonterminals.size(); ++I) {
  NontermIdx[NT] = Pos;
  while (Pos < Sorted.size() && Sorted[Pos].Action == NT)
    ++Pos;
}
NontermIdx.back() = Pos;
// and the same for terminals
```


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRTableBuild.cpp:19
+LRTable LRTable::buildSLR(const Grammar &G) {
+  Builder Build;
+  auto Graph = LRGraph::buildLR0(G);
----------------
(I think it would be worth moving this into LRTable in order to hide the 
Builder type from the header...)


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRTableBuild.cpp:31
+      if (G.lookupRule(I.rule()).Target == G.startSymbol() && !I.hasNext()) {
+        Build.insert(SID, tokenSymbol(tok::eof), Action::accept());
+        continue;
----------------
(rephrasing a comment that got lost earlier in the review)

I'm not totally clear on the scope/purpose of the accept action:
 - if it's meant to be sufficient to know whether the parse succeeded, I think 
it needs the symbol we accepted as a payload. The grammar will accept them all, 
but the caller will place restrictions.
 - if we're OK inspecting the stack and seeing our last reduction was a Decl or 
so, why doens't the parser just do that at the end of the stream and do away 
with `Accept` altogether? (This would avoid the need to splice `eof` tokens 
into token streams to parse subranges of them).


================
Comment at: clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp:20
 
 TEST(LRGraph, Build) {
   struct TestCase {
----------------
Seems sensible to combine the table-building tests in here, but it does make 
the organization of the tests hard.
(Particularly since LRTableTests.cpp exists but the most important tests are in 
this file instead).

Since these cross module boundaries, and they are exactly "bundle of text in, 
bundle of text out"...
how do you feel about making them lit tests instead?

```
# RUN: clang-pseudo -grammar %s -print-graph | FileCheck --check-prefix=GRAPH
# RUN: clang-pseudo -grammar %s -print-table | FileCheck --check-prefix=TABLE
_ := expr
...
#      GRAPH: States:
# GRAPH-NEXT: ...
#      TABLE: LRTable:
# TABLE-NEXT: ...
```


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118196/new/

https://reviews.llvm.org/D118196

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

Reply via email to