[PATCH] D61637: [Syntax] Introduce syntax trees

Sam McCall via Phabricator via cfe-commits Tue, 21 May 2019 07:25:06 -0700

sammccall added a comment.

Definitely like the choice of CompositeNode owning the concrete storage!




================
Comment at: clang/include/clang/Tooling/Syntax/Arena.h:1
+//===- Arena.h - memory arena and bookkeeping for syntax trees --- C++ -*-===//
+//
----------------
From a user's point of view, this looks a lot like part of the tree structure 
in some sense.

If you expect that users will need to keep the arena rather than the 
TokenBuffer around (e.g. so nodes can be allocated), then it might make sense 
to declare it at the bottom of `Cascade.h`


================
Comment at: clang/include/clang/Tooling/Syntax/Corpus.h:38
+  std::pair<FileID, llvm::ArrayRef<syntax::Token>>
+  tokenizeBuffer(std::unique_ptr<llvm::MemoryBuffer> Buffer);
+
----------------
ilya-biryukov wrote:
> sammccall wrote:
> > Are you planning to have a way to add tokens directly? Having to turn them 
> > into text and re-lex them seems like it might be inconvenient.
> The tokens have source locations and refer to a text in some buffer. 
> `tokenizeBuffer` makes is easier, not harder, to mock tokens.
Fair enough.

`lexBuffer` might be a slightly clearer name?

Who are the intended non-test users of this function? Are they better served by 
being able (and responsible) for constructing a MemoryBuffer with e.g. a 
sensible name and ownership, or would it be better to pass a StringRef and have 
the Arena come up with a sensible "anonymous" name?


================
Comment at: clang/include/clang/Tooling/Syntax/Corpus.h:40
+
+  /// Construct a new syntax node of a specified kind. The memory for a node is
+  /// owned by the corpus and will be freed when the corpus is destroyed.
----------------
ilya-biryukov wrote:
> sammccall wrote:
> > Now there's two ways to do this: `new (C.allocator()) T(...)` or 
> > `C.construct<T>(...)`. Do we need both?
> > 
> > (If we do, is the syntax `new (C) T(...)` more natural?)
> I think `C.construct<T>()` read better than `new (C) T(...)`. Not a big fan 
> of placement new exprs.
They are fairly consistently used in llvm/clang for this sort of thing, though.

I find it valuable because arena allocations look otherwise a lot like buggy 
ownership patterns. The dedicated syntax calls out the unusual case: we're 
creating a new thing, and someone else owns it, but won't do anything with it.


================
Comment at: clang/include/clang/Tooling/Syntax/Tree.h:11
+//
+// The code is divided in the following way:
+//   - 'Tree/Cascade.h' defines the basic structure of the syntax tree,
----------------
As discussed offline:
 - I don't think (at this point) we need an umbrella header. Generic tree 
structure, specific node semantics, and operations like "build a tree" are 
distinct enough from a user POV that asking them to include headers is fine. 
 - We may want an umbrella for the node types, if we end up splitting that, but 
no need yet.
 - Splitting generic tree stuff vs specific node stuff sounds good, but I think 
having them be sibling headers in `Tooling/Syntax` is enough - not sure about 
the `Tree/` subdirectory.

So I'd suggest something like:
 - `Tree/Cascade.h` + `Arena.h` --> `Tree.h`
 - `Tree.h` -> `BuildTree.h`
 - `Tree/Nodes.h` + `NodeKind.h` --> `Nodes.h`
(have comments on some of these throughout)


================
Comment at: clang/include/clang/Tooling/Syntax/Tree/Cascade.h:1
+//===- Tree.h - cascade of the syntax tree --------------------*- C++ 
-*-=====//
+//
----------------
ilya-biryukov wrote:
> sammccall wrote:
> > sammccall wrote:
> > > sammccall wrote:
> > > > this is Cascade.h, not tree.h
> > > why "cascade"?
> > The Tree/ subdirectory seems superfluous - why are these separate from 
> > Syntax/?
> Cascade defines a few base nodes: a composite node (`TreeNode`) and a leaf 
> node that holds tokens.
> I'd really like to isolate them from language-specific nodes, so 
> language-specific nodes live in a separate file (`Nodes.h`).
> However, they need to see the definition of a composite node, hence the split.
> 
> Users are advised to use an umbrella header, `Tree.h`. The extra directory is 
> to minimize the number of headers in the top-level directory, having too many 
> is confusing.
I like the separation - my concern here was the specific word "Cascade".
I'd suggest "Tree" here as it really does define the structure.
The existing "Tree.h" defines operations, and can be named after them. As 
discussed, I think the design is clean and doesn't need to be hidden by an 
umbrella header.


================
Comment at: clang/include/clang/Tooling/Syntax/Tree/Cascade.h:74
+/// A composite tree node that has children.
+class TreeNode : public Node {
+public:
----------------
ilya-biryukov wrote:
> sammccall wrote:
> > This use of "tree node" to mean specifically internal node seems confusing 
> > - is it common?
> I don't think it's common, can use `CompositeNode` - seems like a better 
> alternative
What about Subtree?


================
Comment at: clang/include/clang/Tooling/Syntax/Tree/Cascade.h:39
+public:
+  Node(NodeKind Kind) : Parent(nullptr), Kind(Kind) {}
+  NodeKind kind() const { return Kind; }
----------------
maybe add a comment - newly created nodes have no parent until added to one


================
Comment at: clang/include/clang/Tooling/Syntax/Tree/Cascade.h:59
+/// stream.
+class Leaf final : public Node {
+public:
----------------
`syntax::Leaf` vs `syntax::Struct` seems a little odd - it talks about the tree 
structure rather than the contents. (Unlike TreeNode/CompositeNode this is 
likely to be used for its specific semantics).

Maybe `syntax::Tokens` (though the s is subtle). `syntax::Text`?


================
Comment at: clang/include/clang/Tooling/Syntax/Tree/Cascade.h:64
+
+  static bool classof(const Node *N) { return N->kind() == NodeKind::Leaf; }
+
----------------
you're going to want classof for each node type, a private copy constructor 
(for cloning), a friend statement to whatever does the cloning (or the clone() 
function itself, if it goes on the class...)

You may want to put this behind a DEFINE_NODE_BOILERPLATE(Leaf) macro :-(


================
Comment at: clang/include/clang/Tooling/Syntax/Tree/NodeKind.h:1
+//===- NodeKind.h - an enum listing nodes of a syntax tree ----*- C++ 
-*-=====//
+//
----------------
Why is this separate from Nodes.h?


================
Comment at: clang/include/clang/Tooling/Syntax/Tree/NodeKind.h:22
+/// For debugging purposes.
+llvm::StringRef toString(NodeKind K);
+
----------------
if you make this `operator<<`, then it's slightly more flexible I think 
(llvm::to_string also works).
It's not as fast, but I don't think that matters here?


================
Comment at: clang/include/clang/Tooling/Syntax/Tree/NodeList.h:1
+//===- NodeList.h ---------------------------------------------*- C++ 
-*-=====//
+//
----------------
Implementing custom containers is a bit sad.

Alternatives we discussed:
 - adapt bumpptrallocactor to std allocator and use std::vector: wastes a 
pointer from vector->allocator
 - use ArrayRef<Node*> or ArrayRef<Node> and require the whole list to be 
reallocated when children change (but *replacing* children is fine)
 - use a linked-list representation instead: Node* Node::NextSibling, 
CompositeNode::FirstChild. This fits the allocation strategy more nicely. You 
probably need only single links, and can add an iterator if needed.


================
Comment at: clang/lib/Tooling/Syntax/BuildFromAST.cpp:1
+//===- BuildFromAST.cpp ---------------------------------------*- C++ 
-*-=====//
+//
----------------
I haven't reviewed this file yet :-)


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D61637/new/

https://reviews.llvm.org/D61637



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D61637: [Syntax] Introduce syntax trees

Reply via email to