[PATCH] D59887: [Syntax] Introduce TokenBuffer, start clangToolingSyntax library

Ilya Biryukov via Phabricator via cfe-commits Thu, 11 Apr 2019 07:30:52 -0700

ilya-biryukov added a comment.

The new version address most of the comments, there are just a few left in the 
code doing the mapping from Dmitri, I'll look into simplifying and removing the 
possibly redundant checks.
Please take a look at this version, this is very close to the final state.




================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:72
+  /// macro or a name, arguments and parentheses of a function-like macro.
+  llvm::ArrayRef<syntax::Token> macroTokens(const TokenBuffer &B) const;
+  /// The range covering macroTokens().
----------------
ilya-biryukov wrote:
> sammccall wrote:
> > ilya-biryukov wrote:
> > > gribozavr wrote:
> > > > `invocationTokens` or `macroInvocationTokens`
> > > The plan is to eventually include the macro directives tokens, hence the 
> > > name.
> > > `invocationTokens` are somewhat inaccurate in that case, can't come up 
> > > with a better name.
> > I think your reply applies to TokenBuffer::macroTokens(), not 
> > MacroInvocation::macroTokens().
> > 
> > +1 to invocationTokens here.
> Yeah, sorry, I have mixed up these two functions with the same name. Will 
> change to `invocationTokens`.
The Mapping is now internal and only stores the indicies.
The names for two kinds of those are "raw" and "expanded", happy to consider 
alternatives for both.


================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:8
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_CLANG_TOOLING_SYNTAX_TOKEN_BUFFER_H
----------------
sammccall wrote:
> file comment?
> sometimes they're covered by the class comments, but I think there's a bit to 
> say here.
> in particular the logical model (how we model the preprocessor, the two token 
> streams and types of mappings between them) might go here.
Added a file comment explaining the basic concepts inside the file.


================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:32
+public:
+  Token() = default;
+  Token(SourceLocation Location, unsigned Length, tok::TokenKind Kind)
----------------
sammccall wrote:
> what's the default state (and why have one?) do you need a way to query 
> whether a token is "valid"?
> (I'd avoid just relying on length() == 0 or location().isInvalid() because 
> it's not obvious to callers this can happen)
Got rid of default state altogether. I haven' seen the use-cases where it's 
important so far.
Using `Optional<Token>` for invalid tokens seems like a cleaner design anyway 
(we did not need it so far).


================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:60
+
+/// A top-level macro invocation inside a file, e.g.
+///   #define FOO 1+2
----------------
sammccall wrote:
> If you're going to say "invocation" in the name, say "use" in the comment (or 
> vice versa!)
The class is now internal to `TokenBuffer` and is called `Mapping`.


================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:66
+///                            macroTokens = {'BAR', '(', '1', ')'}.
+class MacroInvocation {
+public:
----------------
sammccall wrote:
> I'd suggest a more natural order is token -> tokenbuffer -> macroinvocation. 
> Forward declare?
> 
> This is because the token buffer exposes token streams, which are probably 
> the second-most important concept in the file (after tokens)
Done. I forgot that you could have members of type `vector<T>`  where `T` is 
incomplete.


================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:74
+  /// The range covering macroTokens().
+  std::pair<unsigned, unsigned> macroRange(const TokenBuffer &B,
+                                           const SourceManager &SM) const;
----------------
sammccall wrote:
> It seems weirdly non-orthogonal to be able to get the range but not the file 
> ID.
> 
> I'd suggest either adding another accessor for that, or returning a `struct 
> BufferRange { FileID File; unsigned Begin, End; }` (with a better name.)
> 
> I favor the latter, because `.first` and `.second` don't offer the reader 
> semantic hints at the callsite, and because that gives a nice place to 
> document the half-open nature of the interval.
> 
> Actually, I'd suggest adding such a struct accessor to Token too.
Done. I've used the name `FileRange` instead (the idea is that it's pointing to 
a substring in a file).
Let me know what you think about the name.



================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:87
+  /// to the identifier token corresponding to a name of the expanded macro.
+  unsigned BeginMacroToken = 0;
+  /// End index in TokenBuffer::macroTokens().
----------------
sammccall wrote:
> please rename along with macroTokens() function
This is now `BeginRawToken` and `EndRawToken`.
As usually, open to suggestions wrt to naming.


================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:92
+
+/// A list of tokens obtained by lexing and preprocessing a text buffer and a
+/// set of helpers to allow mapping the tokens after preprocessing to the
----------------
sammccall wrote:
> ilya-biryukov wrote:
> > sammccall wrote:
> > > Warning, braindump follows - let's discuss further.
> > > We talked a bunch offline about the logical and concrete data model here.
> > > 
> > > As things stand:
> > >  - #includes are not expanded, but will refer to a file ID with its own 
> > > buffer: `map<FileID, TokenBuffer>` is the whole structure
> > >  - no information is captured about other PP interactions (PP directives 
> > > that generate no tokens, skipped sections
> > >  - the spelled sequence of tokens is not directly available (but expanded 
> > > + macro invocations may be enough to reconstruct it)
> > > 
> > > If we keep this model, I think we should spell both of these out in the 
> > > comment.
> > > But there's another fairly natural model we could consider:
> > > 
> > >  - we have a single TokenBuffer with a flat list of all expanded tokens 
> > > in the TU, rather than for the file 
> > >  - one FileID corresponds to a contiguous range of *transitive* symbols, 
> > > these ranges nest
> > >  - the TokenBuffer also stores the original tokens as `map<FileID, 
> > > vector<Token>>`
> > >  - spelled -> expanded token mapping: spelled tokens for a file are 
> > > partitioned into ranges (types: literal, include, macro invocation, other 
> > > pp directive, pp skipped region). each range maps onto a range of 
> > > expanded tokens (empty for e.g. pp directive or skipped region)
> > >  - expanded -> spelled token is similar. (It's almost the inverse of the 
> > > of the other mapping, we drop the "include" mapping ranges)
> > >  - This can naturally handle comments, preprocessor directives, pp 
> > > skipped sections, etc - these are in the spelled stream but not the 
> > > expanded stream.
> > > 
> > > e.g. for this code
> > > ```
> > > // foo.cpp
> > > int a;
> > > #include "foo.h"
> > > int b =  FOO(42);
> > > // foo.h
> > > #define FOO(X) X*X
> > > int c;
> > > ```
> > > 
> > > we'd have this buffer:
> > > ```
> > > expandedTokens = [int a ; int c ; int b = 42 * 42 ;]
> > > spelledTokens = {
> > >   "foo.cpp" => [int a; # include "foo.h" int b = FOO ( 42 ) ;],
> > >   "foo.h" => [# define FOO ( X ) X * X int c ;]
> > > }
> > > 
> > > expandedToSpelling = {
> > >   int => int (foo.cpp), type = literal
> > >   a => a
> > >   ; => ;
> > > 
> > >   [] => [# define FOO ( X ) X * X](foo.h), type=preprocessor directive
> > >   int => int (foo.h)
> > >   c => c
> > >   ; => ;
> > > 
> > >   int => int (foo.cpp)
> > >   b => b
> > >   = => =
> > >   [42 * 42] => [FOO ( 42 ) ](foo.cpp), type=macro invocation
> > >   ; => ; (foo.cpp)
> > > }
> > > 
> > > spellingToExpanded = {
> > >   // foo.cpp
> > >   int => int, type=literal
> > >   a => a
> > >   ; => ;
> > >   [# include "foo.h"] => [int c ;], type=include
> > >   int => int
> > >   b => b
> > >   = => =
> > >   [FOO ( X )] => [42 * 42], type=macro invocation
> > >   ; => ;
> > > 
> > >   // foo.h
> > >   [# define FOO ( X ) X] => [], type=preprocessor directive
> > >   int => int
> > >   c => c
> > >   ; => ;
> > > }
> > > ```
> > > 
> > > Various implementation strategies possible here, one obvious one is to 
> > > use a flat sorted list, and include a sequence of literal tokens as a 
> > > single entry. 
> > > 
> > > ```
> > > struct ExpandedSpellingMapping {
> > >   unsigned ExpandedStart, ExpandedEnd;
> > >   FileID SpellingFile; // redundant for illustration: can actually derive 
> > > from SpellingTokens[SpellingStart].location()
> > >   unsigned SpellingStart, SpellingEnd;
> > >   enum { LiteralSequence, MacroInvocation, Preprocessor, PPSkipped, 
> > > Inclusion } Kind;
> > > }
> > > vector<ExpandedSpellingMapping> ExpandedToSpelling; // bsearchable
> > > vector<pair<FileID, ExpandedSpellingMapping>> Inclusions; // for spelling 
> > > -> expanded mapping. redundant: much can be taken from SourceManager.
> > > ```
> > > 
> > > A delta-encoded representation with chunks for bsearch will likely be 
> > > much more compact, if this proves large.
> > > (Kind of similar to the compressed posting lists in clangd Dex)
> > > 
> > > But as-is, the mappings for the example file would be:
> > > ```
> > > Expanded = {
> > >   {0, 3, File0, 0, 3, LiteralSequence}, // int a;
> > >   {3, 3, File1, 0, 8, Preprocessor}, // #define FOO(X) X * X
> > >   {3, 6, File1, 8, 11, LiteralSequence}, // int c;
> > >   {6, 9, File0, 6, 9, LiteralSequence}, // int b =
> > >   {9, 12, File0, 9, 13, MacroExpansion}, // FOO(42)
> > >   {12, 13, File0, 13, 14, LiteralSequence}, // ;
> > > }
> > > Inclusions = {
> > >   {File1, {3, 6, File0, 3, 6, Inclusion}}, // #include 
> > > }
> > > ```
> > Thanks for raising the multi-file issue earlier rather than later. My 
> > original intention was to model `TokenBuffer` as a stream of expanded 
> > tokens **for a single FileID**, i.e. we would have multiple `TokenBuffers` 
> > for multiple files.
> > 
> > Your suggestion of having a single set of expanded tokens and a map from 
> > FileID to vectors of raw tokens (i.e. before preprocessing) looks very 
> > nice. A single expanded token stream is definitely a better model for the 
> > syntax trees (each node of a tree would cover a continuous subrange of the 
> > expanded token stream).
> > 
> > In short, here are the highlights of the proposed model that I find most 
> > notable are:
> > - store a **single** stream of expanded tokens for a TU.
> > - store raw tokens for each file (mapped either by FileID or FileEntry).
> > - store information about preprocessor directives and macro replacements in 
> > each FileID (i.e. `MacroInvocation` in this patch)
> > - support an operation of mapping a subrange of expanded tokens into the 
> > subrange of raw-tokens in a particular FileID. This operation might fail.
> > 
> > Let me know if I missed something.
> Yes, I think we're on the same page.
> 
> Some more details:
> > store a single stream of expanded tokens for a TU.
> BTW I don't think we're giving up the ability to tokenize only a subset of 
> files. If we filter out files, we can just omit their spelled token stream 
> and omit their tokens from the expanded stream. There's complexity if we want 
> to reject a file and accept its includes, but I think the most important case 
> is "tokenize the main file only" where that's not an issue.
> 
> > store information about preprocessor directives and macro replacements in 
> > each FileID
> I do think it's useful to treat these as similar things - both for the 
> implementation of mapping between streams, and for users of the API. My hope 
> is that most callsites won't have to consider issues of macros, PP 
> directives, ifdef'd code, etc as separate issues.
> 
> > support an operation of mapping a subrange of expanded tokens into the 
> > subrange of raw-tokens in a particular FileID. This operation might fail
> Yes, though I don't have a great intuition for what the API should look like
>  - if you select part of an expansion, should the result include the macro 
> invocation, exclude, or fail?
>  - if the expansion range ends at a zero-size expansion (like a PP 
> directive), should it be included or excluded?
>  - if the result spans files, is that a failure or do we represent it somehow?
> Some of these probably need to be configurable
> (We also have the mapping in the other direction, which shouldn't fail)
> Yes, though I don't have a great intuition for what the API should look like

Agree, and I think there's room for all of these options. For the sake of 
simplicity in the initial implementation, I would simply pick the set of 
trade-offs that work for moving macro calls that span full syntax subtrees 
around and document them.

When the use-cases pop up, we could add more configuration, refactor the APIs, 
etc.



================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:117
+  TokenBuffer() = default;
+  // Assumes no macro replacements have taken place.
+  TokenBuffer(std::vector<syntax::Token> Tokens);
----------------
sammccall wrote:
> `, preprocessor directives are parsed but not interpreted`
> 
> If using the model above, `this means that the spelled and expanded streams 
> are identical`. (Or are we still going to strip comments?)
> 
Removed this constructor altogether, it does not make much sense in the new 
model.
Instead, `tokenize()` now returns the raw tokens directly in `vector<Token>`.


================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:146
+  llvm::Optional<std::pair<unsigned, unsigned>>
+  toOffsetRange(const Token *Begin, const Token *End,
+                const SourceManager &SM) const;
----------------
ilya-biryukov wrote:
> sammccall wrote:
> > I'd think this would map a character range onto a character range, or a 
> > token range onto a token range - is there some reason otherwise?
> No particular reason, this was driven by its usage in function computing 
> replacements (not in this patch, see [[ 
> https://github.com/ilya-biryukov/llvm-project/blob/syntax/clang/lib/Tooling/Syntax/ComputeReplacements.cpp
>  | github prototype ]], if interested, but keep in mind the prototype is not 
> ready for review yet).
> 
> Mapping to a range of "raw" tokens seems more consistent with our model, will 
> update accordingly, obtaining a range is easy after one has tokens from the 
> file.
Added a method to map from an expanded token range to a raw token range.
Kept the method that maps an expanded token range to a character range too.

Note that we don't currently have a way to map from raw token range to 
expanded, that could be added later, we don't need it for the initial use-case 
(map the ranges coming from AST nodes into raw source ranges), so I left it out 
for now.


================
Comment at: clang/lib/Tooling/Syntax/TokenBuffer.cpp:61
+
+class TokenCollector::Callbacks : public PPCallbacks {
+public:
----------------
sammccall wrote:
> I'm afraid this really needs a class comment describing what this is supposed 
> to do (easy!) and the implementation strategy (hard!)
> 
> In particular, it looks like macros like this:
>  - we expect to see the tokens making up the macro invocation first (... FOO 
> ( 42 ))
>  - then we see MacroExpands which allows us to recognize the last N tokens 
> are a macro invocation. We create a MacroInvocation object, and remember 
> where the invocation ends
>  - then we see the tokens making up the macro expansion
>  - finally, once we see the next token after the invocation, we finalize the 
> MacroInvocation.
Added a comment. The model is now a bit simpler (more work is done by the 
preprocessor), but let me know if the comment is still unclear and could be 
improved.


================
Comment at: clang/lib/Tooling/Syntax/TokenBuffer.cpp:285
+    std::tie(File, Offset) = SM.getDecomposedLoc(L);
+    if (File != MacroInvocationFile || Offset <= ExpansionEndOffset)
+      return;
----------------
sammccall wrote:
> is there a problem if we never see another token in that file after the 
> expansion?
> (or do we see the eof token in this case?)
Yeah, there should always be an eof.
Added a comment and a test for this.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59887/new/

https://reviews.llvm.org/D59887



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D59887: [Syntax] Introduce TokenBuffer, start clangToolingSyntax library

Reply via email to