[PATCH] D59887: [Syntax] Introduce TokenBuffer, start clangToolingSyntax library

Ilya Biryukov via Phabricator via cfe-commits Tue, 02 Apr 2019 01:46:10 -0700

ilya-biryukov added a comment.

Will address the rest of the comments later, answering a few questions that 
popped up first.




================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:72
+  /// macro or a name, arguments and parentheses of a function-like macro.
+  llvm::ArrayRef<syntax::Token> macroTokens(const TokenBuffer &B) const;
+  /// The range covering macroTokens().
----------------
sammccall wrote:
> ilya-biryukov wrote:
> > gribozavr wrote:
> > > `invocationTokens` or `macroInvocationTokens`
> > The plan is to eventually include the macro directives tokens, hence the 
> > name.
> > `invocationTokens` are somewhat inaccurate in that case, can't come up with 
> > a better name.
> I think your reply applies to TokenBuffer::macroTokens(), not 
> MacroInvocation::macroTokens().
> 
> +1 to invocationTokens here.
Yeah, sorry, I have mixed up these two functions with the same name. Will 
change to `invocationTokens`.


================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:1
+//===- TokenBuffer.h - store tokens of preprocessed files -----*- C++ 
-*-=====//
+//
----------------
sammccall wrote:
> are you sure TokenBuffer is the central concept in this file, rather than 
> just the thing with the most code? Token.h might end up being a better name 
> for users.
I don't mind changing this to `Token.h`, although I'd personally expect that a 
file with this name only contains a definition for the token class.
`Tokens.h` would be a better fit from my POV. WDYT?


================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:92
+
+/// A list of tokens obtained by lexing and preprocessing a text buffer and a
+/// set of helpers to allow mapping the tokens after preprocessing to the
----------------
sammccall wrote:
> Warning, braindump follows - let's discuss further.
> We talked a bunch offline about the logical and concrete data model here.
> 
> As things stand:
>  - #includes are not expanded, but will refer to a file ID with its own 
> buffer: `map<FileID, TokenBuffer>` is the whole structure
>  - no information is captured about other PP interactions (PP directives that 
> generate no tokens, skipped sections
>  - the spelled sequence of tokens is not directly available (but expanded + 
> macro invocations may be enough to reconstruct it)
> 
> If we keep this model, I think we should spell both of these out in the 
> comment.
> But there's another fairly natural model we could consider:
> 
>  - we have a single TokenBuffer with a flat list of all expanded tokens in 
> the TU, rather than for the file 
>  - one FileID corresponds to a contiguous range of *transitive* symbols, 
> these ranges nest
>  - the TokenBuffer also stores the original tokens as `map<FileID, 
> vector<Token>>`
>  - spelled -> expanded token mapping: spelled tokens for a file are 
> partitioned into ranges (types: literal, include, macro invocation, other pp 
> directive, pp skipped region). each range maps onto a range of expanded 
> tokens (empty for e.g. pp directive or skipped region)
>  - expanded -> spelled token is similar. (It's almost the inverse of the of 
> the other mapping, we drop the "include" mapping ranges)
>  - This can naturally handle comments, preprocessor directives, pp skipped 
> sections, etc - these are in the spelled stream but not the expanded stream.
> 
> e.g. for this code
> ```
> // foo.cpp
> int a;
> #include "foo.h"
> int b =  FOO(42);
> // foo.h
> #define FOO(X) X*X
> int c;
> ```
> 
> we'd have this buffer:
> ```
> expandedTokens = [int a ; int c ; int b = 42 * 42 ;]
> spelledTokens = {
>   "foo.cpp" => [int a; # include "foo.h" int b = FOO ( 42 ) ;],
>   "foo.h" => [# define FOO ( X ) X * X int c ;]
> }
> 
> expandedToSpelling = {
>   int => int (foo.cpp), type = literal
>   a => a
>   ; => ;
> 
>   [] => [# define FOO ( X ) X * X](foo.h), type=preprocessor directive
>   int => int (foo.h)
>   c => c
>   ; => ;
> 
>   int => int (foo.cpp)
>   b => b
>   = => =
>   [42 * 42] => [FOO ( 42 ) ](foo.cpp), type=macro invocation
>   ; => ; (foo.cpp)
> }
> 
> spellingToExpanded = {
>   // foo.cpp
>   int => int, type=literal
>   a => a
>   ; => ;
>   [# include "foo.h"] => [int c ;], type=include
>   int => int
>   b => b
>   = => =
>   [FOO ( X )] => [42 * 42], type=macro invocation
>   ; => ;
> 
>   // foo.h
>   [# define FOO ( X ) X] => [], type=preprocessor directive
>   int => int
>   c => c
>   ; => ;
> }
> ```
> 
> Various implementation strategies possible here, one obvious one is to use a 
> flat sorted list, and include a sequence of literal tokens as a single entry. 
> 
> ```
> struct ExpandedSpellingMapping {
>   unsigned ExpandedStart, ExpandedEnd;
>   FileID SpellingFile; // redundant for illustration: can actually derive 
> from SpellingTokens[SpellingStart].location()
>   unsigned SpellingStart, SpellingEnd;
>   enum { LiteralSequence, MacroInvocation, Preprocessor, PPSkipped, Inclusion 
> } Kind;
> }
> vector<ExpandedSpellingMapping> ExpandedToSpelling; // bsearchable
> vector<pair<FileID, ExpandedSpellingMapping>> Inclusions; // for spelling -> 
> expanded mapping. redundant: much can be taken from SourceManager.
> ```
> 
> A delta-encoded representation with chunks for bsearch will likely be much 
> more compact, if this proves large.
> (Kind of similar to the compressed posting lists in clangd Dex)
> 
> But as-is, the mappings for the example file would be:
> ```
> Expanded = {
>   {0, 3, File0, 0, 3, LiteralSequence}, // int a;
>   {3, 3, File1, 0, 8, Preprocessor}, // #define FOO(X) X * X
>   {3, 6, File1, 8, 11, LiteralSequence}, // int c;
>   {6, 9, File0, 6, 9, LiteralSequence}, // int b =
>   {9, 12, File0, 9, 13, MacroExpansion}, // FOO(42)
>   {12, 13, File0, 13, 14, LiteralSequence}, // ;
> }
> Inclusions = {
>   {File1, {3, 6, File0, 3, 6, Inclusion}}, // #include 
> }
> ```
Thanks for raising the multi-file issue earlier rather than later. My original 
intention was to model `TokenBuffer` as a stream of expanded tokens **for a 
single FileID**, i.e. we would have multiple `TokenBuffers` for multiple files.

Your suggestion of having a single set of expanded tokens and a map from FileID 
to vectors of raw tokens (i.e. before preprocessing) looks very nice. A single 
expanded token stream is definitely a better model for the syntax trees (each 
node of a tree would cover a continuous subrange of the expanded token stream).

In short, here are the highlights of the proposed model that I find most 
notable are:
- store a **single** stream of expanded tokens for a TU.
- store raw tokens for each file (mapped either by FileID or FileEntry).
- store information about preprocessor directives and macro replacements in 
each FileID (i.e. `MacroInvocation` in this patch)
- support an operation of mapping a subrange of expanded tokens into the 
subrange of raw-tokens in a particular FileID. This operation might fail.

Let me know if I missed something.


================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:146
+  llvm::Optional<std::pair<unsigned, unsigned>>
+  toOffsetRange(const Token *Begin, const Token *End,
+                const SourceManager &SM) const;
----------------
sammccall wrote:
> I'd think this would map a character range onto a character range, or a token 
> range onto a token range - is there some reason otherwise?
No particular reason, this was driven by its usage in function computing 
replacements (not in this patch, see [[ 
https://github.com/ilya-biryukov/llvm-project/blob/syntax/clang/lib/Tooling/Syntax/ComputeReplacements.cpp
 | github prototype ]], if interested, but keep in mind the prototype is not 
ready for review yet).

Mapping to a range of "raw" tokens seems more consistent with our model, will 
update accordingly, obtaining a range is easy after one has tokens from the 
file.


================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:171
+  ///        #pragma, etc.
+  llvm::ArrayRef<syntax::Token> macroTokens() const { return MacroTokens; }
+
----------------
sammccall wrote:
> not sure quite what this is for, but "tokens not in expanded stream" might 
> include header-guarded #includes, comments, ifdef-'d sections, #defines...
> is this API intended to be a whitelist (macro invocations specifically) or a 
> blacklist (everything that's not literally in the expanded stream)
This was supposed to be all **raw** tokens in a file, which are not part of the 
expanded token stream, i.e. all PP directives, macro-replaced-tokens 
(object-like and function-like macros), tokens of skipped else branches, etc.

In your proposed model we would store raw tokens of a file separately and this 
would not be necessary.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59887/new/

https://reviews.llvm.org/D59887



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D59887: [Syntax] Introduce TokenBuffer, start clangToolingSyntax library

Reply via email to