[PATCH] D61637: [Syntax] Introduce syntax trees

Ilya Biryukov via Phabricator via cfe-commits Mon, 03 Jun 2019 10:00:55 -0700

ilya-biryukov added a comment.

I've addressed most of the comments, except the naming ones.
We need a convention for naming the language nodes and names for composite and 
leaf structural nodes.


For "language" nodes, I suggest we use  `CompoundStatement`, `Recovery`, 
`TopLevelDeclaration`, `TemplateArgumentList`, `TypeTemplateArgument`, etc. 
That is, we spell out the words in full, no shorthands like `Stmt` or `Expr`. 
That would make things a bit more verbose, but hopefully that helps distinguish 
from clang AST.

For structural nodes, see the relevant comment threads.



================
Comment at: clang/include/clang/Tooling/Syntax/Arena.h:1
+//===- Arena.h - memory arena and bookkeeping for syntax trees --- C++ -*-===//
+//
----------------
sammccall wrote:
> From a user's point of view, this looks a lot like part of the tree structure 
> in some sense.
> 
> If you expect that users will need to keep the arena rather than the 
> TokenBuffer around (e.g. so nodes can be allocated), then it might make sense 
> to declare it at the bottom of `Cascade.h`
Now part of `Tree.h`


================
Comment at: clang/include/clang/Tooling/Syntax/Corpus.h:38
+  std::pair<FileID, llvm::ArrayRef<syntax::Token>>
+  tokenizeBuffer(std::unique_ptr<llvm::MemoryBuffer> Buffer);
+
----------------
ilya-biryukov wrote:
> sammccall wrote:
> > ilya-biryukov wrote:
> > > sammccall wrote:
> > > > Are you planning to have a way to add tokens directly? Having to turn 
> > > > them into text and re-lex them seems like it might be inconvenient.
> > > The tokens have source locations and refer to a text in some buffer. 
> > > `tokenizeBuffer` makes is easier, not harder, to mock tokens.
> > Fair enough.
> > 
> > `lexBuffer` might be a slightly clearer name?
> > 
> > Who are the intended non-test users of this function? Are they better 
> > served by being able (and responsible) for constructing a MemoryBuffer with 
> > e.g. a sensible name and ownership, or would it be better to pass a 
> > StringRef and have the Arena come up with a sensible "anonymous" name?
> The only use-case in my prototype so far is creating token nodes for 
> punctuation nodes, e.g. say you want to create an expr of the form `<a>+<b>`, 
> where both `a` and `b` are existing expressions and you need to synthesize a 
> leaf node for `+`.
> We use this function to synthesize a buffer with the corresponding token.
> 
> All the use-cases I can imagine are around synthesizing syntax trees (as 
> opposed to constructing them from the AST).
Renamed to `lexBuffer`. This does not have usages (yet), we can also remove it 
from this patch if needed.


================
Comment at: clang/include/clang/Tooling/Syntax/Corpus.h:40
+
+  /// Construct a new syntax node of a specified kind. The memory for a node is
+  /// owned by the corpus and will be freed when the corpus is destroyed.
----------------
sammccall wrote:
> ilya-biryukov wrote:
> > sammccall wrote:
> > > Now there's two ways to do this: `new (C.allocator()) T(...)` or 
> > > `C.construct<T>(...)`. Do we need both?
> > > 
> > > (If we do, is the syntax `new (C) T(...)` more natural?)
> > I think `C.construct<T>()` read better than `new (C) T(...)`. Not a big fan 
> > of placement new exprs.
> They are fairly consistently used in llvm/clang for this sort of thing, 
> though.
> 
> I find it valuable because arena allocations look otherwise a lot like buggy 
> ownership patterns. The dedicated syntax calls out the unusual case: we're 
> creating a new thing, and someone else owns it, but won't do anything with it.
Removed in favour of placement new. I guess that also makes it a bit more 
natural to have separate storage for other nodes that have different lifetime 
(e.g. use a separate arena).


================
Comment at: clang/include/clang/Tooling/Syntax/Tree.h:11
+//
+// The code is divided in the following way:
+//   - 'Tree/Cascade.h' defines the basic structure of the syntax tree,
----------------
sammccall wrote:
> As discussed offline:
>  - I don't think (at this point) we need an umbrella header. Generic tree 
> structure, specific node semantics, and operations like "build a tree" are 
> distinct enough from a user POV that asking them to include headers is fine. 
>  - We may want an umbrella for the node types, if we end up splitting that, 
> but no need yet.
>  - Splitting generic tree stuff vs specific node stuff sounds good, but I 
> think having them be sibling headers in `Tooling/Syntax` is enough - not sure 
> about the `Tree/` subdirectory.
> 
> So I'd suggest something like:
>  - `Tree/Cascade.h` + `Arena.h` --> `Tree.h`
>  - `Tree.h` -> `BuildTree.h`
>  - `Tree/Nodes.h` + `NodeKind.h` --> `Nodes.h`
> (have comments on some of these throughout)
Changed to the proposed header structure, it looks good.
Had to move `Leaf::classof` and `TreeNode::classof` into a `.cpp`, they need a 
definition of `NodeKind`. Keeping those in a header file was a 
micro-optimization anyway, that's probably not too important.


================
Comment at: clang/include/clang/Tooling/Syntax/Tree/Cascade.h:74
+/// A composite tree node that has children.
+class TreeNode : public Node {
+public:
----------------
sammccall wrote:
> ilya-biryukov wrote:
> > sammccall wrote:
> > > This use of "tree node" to mean specifically internal node seems 
> > > confusing - is it common?
> > I don't think it's common, can use `CompositeNode` - seems like a better 
> > alternative
> What about Subtree?
`Subtree` seems ok, although `Composite` conveys the meaning better to my taste.

`Composite` does not seem to work without the `node` suffix, though, and we 
probably don't want the suffix in other nodes, so I'm torn on this.


================
Comment at: clang/include/clang/Tooling/Syntax/Tree/Cascade.h:59
+/// stream.
+class Leaf final : public Node {
+public:
----------------
sammccall wrote:
> `syntax::Leaf` vs `syntax::Struct` seems a little odd - it talks about the 
> tree structure rather than the contents. (Unlike TreeNode/CompositeNode this 
> is likely to be used for its specific semantics).
> 
> Maybe `syntax::Tokens` (though the s is subtle). `syntax::Text`?
`syntax::Tokens` actually looks good, but we should rename the `tokens()` 
accessor somehow in that case.
I have only super-generic variants in my head: `elements()`, `items()`. Any 
better ideas?


================
Comment at: clang/include/clang/Tooling/Syntax/Tree/Cascade.h:64
+
+  static bool classof(const Node *N) { return N->kind() == NodeKind::Leaf; }
+
----------------
sammccall wrote:
> you're going to want classof for each node type, a private copy constructor 
> (for cloning), a friend statement to whatever does the cloning (or the 
> clone() function itself, if it goes on the class...)
> 
> You may want to put this behind a DEFINE_NODE_BOILERPLATE(Leaf) macro :-(
I'd avoid using macros. As long as the amount of boilerplate is small, it's not 
too annoying to write. 

And it is small for now, we only have a single constructor and a `classof` per 
node, so keeping it explicit in this patch seems ok.

We can revisit if more stuff pops up, but I think we do it without extra 
boilerplate per node for clone, and hopefully for other things too.


================
Comment at: clang/include/clang/Tooling/Syntax/Tree/NodeKind.h:1
+//===- NodeKind.h - an enum listing nodes of a syntax tree ----*- C++ 
-*-=====//
+//
----------------
sammccall wrote:
> Why is this separate from Nodes.h?
Not anymore. The original reason was that `Tree.h` need `NodeKind`  for 
implementing casts and the implementation was in a header.


================
Comment at: clang/include/clang/Tooling/Syntax/Tree/NodeKind.h:22
+/// For debugging purposes.
+llvm::StringRef toString(NodeKind K);
+
----------------
sammccall wrote:
> if you make this `operator<<`, then it's slightly more flexible I think 
> (llvm::to_string also works).
> It's not as fast, but I don't think that matters here?
For some reason I thought `gdb` does not show the enum value names, so I added 
a named method to simplify debugging.

Turns out it does show the enum names, not just int values, so I'm perfectly 
happy to the stream output operator.


================
Comment at: clang/include/clang/Tooling/Syntax/Tree/NodeList.h:1
+//===- NodeList.h ---------------------------------------------*- C++ 
-*-=====//
+//
----------------
sammccall wrote:
> Implementing custom containers is a bit sad.
> 
> Alternatives we discussed:
>  - adapt bumpptrallocactor to std allocator and use std::vector: wastes a 
> pointer from vector->allocator
>  - use ArrayRef<Node*> or ArrayRef<Node> and require the whole list to be 
> reallocated when children change (but *replacing* children is fine)
>  - use a linked-list representation instead: Node* Node::NextSibling, 
> CompositeNode::FirstChild. This fits the allocation strategy more nicely. You 
> probably need only single links, and can add an iterator if needed.
No more custom containers, explicit tree structure seems to be better.


================
Comment at: clang/lib/Tooling/Syntax/BuildFromAST.cpp:1
+//===- BuildFromAST.cpp ---------------------------------------*- C++ 
-*-=====//
+//
----------------
sammccall wrote:
> I haven't reviewed this file yet :-)
Please do!


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D61637/new/

https://reviews.llvm.org/D61637



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D61637: [Syntax] Introduce syntax trees

Reply via email to