sammccall added a comment.

OK, I think I understand it now! Sorry for so many comments, happy to chat 
offline about which ones to address/where to start.



================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:39
+public:
+  struct Action {
+    enum class Kind : uint8_t {
----------------
API polish: this class is a mix of accessors and direct member state - it can 
be hard to know how to best interact with such a thing.

I'd suggest:
 - public factories `static Action Action::reduce(RuleID)` etc, private data & 
constructor
 - expose `Kind kind()`, but not the individual is() methods: they don't buy 
much and this pattern encourages switch where appropriate
 - make `Kind` enum rather than `enum class` - `Action::Shift` is sufficiently 
scoped already
 - shift() & goTo() -> `targetState()` which asserts either condition. Apart 
from anything else, this makes naming easier: "shift" is a verb so `shift()` is 
a confusing name.


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:39
+public:
+  struct Action {
+    enum class Kind : uint8_t {
----------------
sammccall wrote:
> API polish: this class is a mix of accessors and direct member state - it can 
> be hard to know how to best interact with such a thing.
> 
> I'd suggest:
>  - public factories `static Action Action::reduce(RuleID)` etc, private data 
> & constructor
>  - expose `Kind kind()`, but not the individual is() methods: they don't buy 
> much and this pattern encourages switch where appropriate
>  - make `Kind` enum rather than `enum class` - `Action::Shift` is 
> sufficiently scoped already
>  - shift() & goTo() -> `targetState()` which asserts either condition. Apart 
> from anything else, this makes naming easier: "shift" is a verb so `shift()` 
> is a confusing name.
mention that this combines the Action and Go-to tables from LR literature?


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:41
+    enum class Kind : uint8_t {
+      // Action table
+      Error = 0,
----------------
I think it's worth documenting what each of these actions means.

Referring to enum values as "action table" and "goto table" seems confusing - 
they're part of the same table! The main reference point when reading the code 
needs to be the code, rather than the textbook.


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:67
+
+    bool operator==(const Action &L) const {
+      if (K != L.K)
----------------
if you make the data members private the union could just be a uint16 and save 
yourself some trouble here :-) Up to you


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:84
+
+    Kind K = Kind::Error;
+    // Value
----------------
This action struct can be squeezed to 16 bits.

Kind only needs 3 bits (I suspect it can be 2: Error and Accept aren't heavily 
used).
RuleID is 12 bits, StateID would fit within 11 and 12 should be safe.

Combined with the optimization suggested below, this should get the whole table 
down to  335KB.


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:102
+  // Returns the transisted state for the state From on a nonterminal.
+  StateID goTo(StateID From, SymbolID NonTerminal) const {
+    assert(pseudo::isNonterminal(NonTerminal));
----------------
nit: please be consistent with `Nonterminal` (and "nonterminal") vs 
`NonTerminal` (and "non-terminal"). I think `Nonterminal` is used elsewhere in 
the code.


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:105
+    auto Goto = find({From, NonTerminal});
+    assert(Goto.size() == 1 && Goto.front().isGoTo());
+    return Goto.front().goTo();
----------------
Why is this assertion valid? Is it a precondition of the function that some 
item in From can advance over NonTerminal?


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:121
+private:
+  struct IndexKey {
+    StateID From;
----------------
I think there's an even more compact representation:
Basically sort the index keys as symbol-major, but then don't store the actual 
symbol there but rather store the range for the symbol in an external array:

```
// index is lookahead tok::TokenKind. Values are indices into Index: the range 
for tok is StateIdx[TokenIdx[tok]...TokenIdx[tok+1]]
vector<unsigned> TokenIdx;

// index is nonterminal SymbolID. Values are  indices into Index: the range for 
sym is StateIdx[NontermIdx[sym]...NontermIdx[sym+1]].
vector<unsigned> NontermIdx;

// parallel to actions, value is the from state. Only subranges are sorted.
vector<StateID> StateIdx;

vector<Action> Actions;
```

This should be 4 * (392 + 245) + (2 + 4) * 83k ~= 500k. And the one extra 
indirection should save 5-10 steps of binary search.

This is equivalent to 2 * `vector<unsigned>` + `vector<pair<StateID, Action>>` 
but keeping the searched index compact is probably a good thing.


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp:74
+      Item NewItem(RID, G.lookupRule(RID).size(), /*dot*/ 0);
+      if (IS.insert(NewItem).second) // new
+        ProcessingItems.push_back(std::move(NewItem));
----------------
As far as I can tell, this is the only place where we need to deduplicate/check 
for membership in the set.

WDYT about using a local DenseSet<Item> in `closure()` itself for this, and 
making the representation of ItemSet a (small)vector?

(IIUC the reason for std::set over DenseMap is determinism. Using vector should 
preserve that: if you really want sorting you can sort after computing closure, 
but if we just want a well-defined order to ensure stable state numbers then 
insertion order should work fine)


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp:109
+      auto Hash = hash_code(S);
+      auto It = StatesIndex.find(Hash);
+      if (It != StatesIndex.end())
----------------
using hashcode instead of the key itself should be explained.
(If it's just an issue with hash of std::set, I think using std::vector fixes 
it)


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp:157
+        continue;
+      auto Insert = Builder.insert(NextState);
+      if (Insert.second) // new state, insert to the processing.
----------------
sammccall wrote:
> Hmm, and advance() is computing closures from scratch every time, probably 
> just to find "oh, the state already existed". Seems like we could save work 
> by using the kernel set to represent the full set until we actually need to 
> iterate over its items.
> 
> I'd like to think about this some more.
Yeah, I think this should work. Builder would maintain a map from (sorted 
kernel items) => stateID, and Builder.insert() would call closure() only if 
actually inserting.


================
Comment at: clang/tools/clang-pseudo/ClangPseudo.cpp:46
     llvm::errs() << llvm::formatv("grammar file {0} is parsed successfully\n", 
CheckGrammar);
+    llvm::errs()
+        << clang::syntax::pseudo::LRTable::build(*RSpecs).dumpStatistics();
----------------
I think that we should also have flags to:
 - dump the grammar
 - dump the LR table


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118196/new/

https://reviews.llvm.org/D118196

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to