ilya-biryukov added a comment. Will address the rest of the comments later, answering a few questions that popped up first.
================ Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:72 + /// macro or a name, arguments and parentheses of a function-like macro. + llvm::ArrayRef<syntax::Token> macroTokens(const TokenBuffer &B) const; + /// The range covering macroTokens(). ---------------- sammccall wrote: > ilya-biryukov wrote: > > gribozavr wrote: > > > `invocationTokens` or `macroInvocationTokens` > > The plan is to eventually include the macro directives tokens, hence the > > name. > > `invocationTokens` are somewhat inaccurate in that case, can't come up with > > a better name. > I think your reply applies to TokenBuffer::macroTokens(), not > MacroInvocation::macroTokens(). > > +1 to invocationTokens here. Yeah, sorry, I have mixed up these two functions with the same name. Will change to `invocationTokens`. ================ Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:1 +//===- TokenBuffer.h - store tokens of preprocessed files -----*- C++ -*-=====// +// ---------------- sammccall wrote: > are you sure TokenBuffer is the central concept in this file, rather than > just the thing with the most code? Token.h might end up being a better name > for users. I don't mind changing this to `Token.h`, although I'd personally expect that a file with this name only contains a definition for the token class. `Tokens.h` would be a better fit from my POV. WDYT? ================ Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:92 + +/// A list of tokens obtained by lexing and preprocessing a text buffer and a +/// set of helpers to allow mapping the tokens after preprocessing to the ---------------- sammccall wrote: > Warning, braindump follows - let's discuss further. > We talked a bunch offline about the logical and concrete data model here. > > As things stand: > - #includes are not expanded, but will refer to a file ID with its own > buffer: `map<FileID, TokenBuffer>` is the whole structure > - no information is captured about other PP interactions (PP directives that > generate no tokens, skipped sections > - the spelled sequence of tokens is not directly available (but expanded + > macro invocations may be enough to reconstruct it) > > If we keep this model, I think we should spell both of these out in the > comment. > But there's another fairly natural model we could consider: > > - we have a single TokenBuffer with a flat list of all expanded tokens in > the TU, rather than for the file > - one FileID corresponds to a contiguous range of *transitive* symbols, > these ranges nest > - the TokenBuffer also stores the original tokens as `map<FileID, > vector<Token>>` > - spelled -> expanded token mapping: spelled tokens for a file are > partitioned into ranges (types: literal, include, macro invocation, other pp > directive, pp skipped region). each range maps onto a range of expanded > tokens (empty for e.g. pp directive or skipped region) > - expanded -> spelled token is similar. (It's almost the inverse of the of > the other mapping, we drop the "include" mapping ranges) > - This can naturally handle comments, preprocessor directives, pp skipped > sections, etc - these are in the spelled stream but not the expanded stream. > > e.g. for this code > ``` > // foo.cpp > int a; > #include "foo.h" > int b = FOO(42); > // foo.h > #define FOO(X) X*X > int c; > ``` > > we'd have this buffer: > ``` > expandedTokens = [int a ; int c ; int b = 42 * 42 ;] > spelledTokens = { > "foo.cpp" => [int a; # include "foo.h" int b = FOO ( 42 ) ;], > "foo.h" => [# define FOO ( X ) X * X int c ;] > } > > expandedToSpelling = { > int => int (foo.cpp), type = literal > a => a > ; => ; > > [] => [# define FOO ( X ) X * X](foo.h), type=preprocessor directive > int => int (foo.h) > c => c > ; => ; > > int => int (foo.cpp) > b => b > = => = > [42 * 42] => [FOO ( 42 ) ](foo.cpp), type=macro invocation > ; => ; (foo.cpp) > } > > spellingToExpanded = { > // foo.cpp > int => int, type=literal > a => a > ; => ; > [# include "foo.h"] => [int c ;], type=include > int => int > b => b > = => = > [FOO ( X )] => [42 * 42], type=macro invocation > ; => ; > > // foo.h > [# define FOO ( X ) X] => [], type=preprocessor directive > int => int > c => c > ; => ; > } > ``` > > Various implementation strategies possible here, one obvious one is to use a > flat sorted list, and include a sequence of literal tokens as a single entry. > > ``` > struct ExpandedSpellingMapping { > unsigned ExpandedStart, ExpandedEnd; > FileID SpellingFile; // redundant for illustration: can actually derive > from SpellingTokens[SpellingStart].location() > unsigned SpellingStart, SpellingEnd; > enum { LiteralSequence, MacroInvocation, Preprocessor, PPSkipped, Inclusion > } Kind; > } > vector<ExpandedSpellingMapping> ExpandedToSpelling; // bsearchable > vector<pair<FileID, ExpandedSpellingMapping>> Inclusions; // for spelling -> > expanded mapping. redundant: much can be taken from SourceManager. > ``` > > A delta-encoded representation with chunks for bsearch will likely be much > more compact, if this proves large. > (Kind of similar to the compressed posting lists in clangd Dex) > > But as-is, the mappings for the example file would be: > ``` > Expanded = { > {0, 3, File0, 0, 3, LiteralSequence}, // int a; > {3, 3, File1, 0, 8, Preprocessor}, // #define FOO(X) X * X > {3, 6, File1, 8, 11, LiteralSequence}, // int c; > {6, 9, File0, 6, 9, LiteralSequence}, // int b = > {9, 12, File0, 9, 13, MacroExpansion}, // FOO(42) > {12, 13, File0, 13, 14, LiteralSequence}, // ; > } > Inclusions = { > {File1, {3, 6, File0, 3, 6, Inclusion}}, // #include > } > ``` Thanks for raising the multi-file issue earlier rather than later. My original intention was to model `TokenBuffer` as a stream of expanded tokens **for a single FileID**, i.e. we would have multiple `TokenBuffers` for multiple files. Your suggestion of having a single set of expanded tokens and a map from FileID to vectors of raw tokens (i.e. before preprocessing) looks very nice. A single expanded token stream is definitely a better model for the syntax trees (each node of a tree would cover a continuous subrange of the expanded token stream). In short, here are the highlights of the proposed model that I find most notable are: - store a **single** stream of expanded tokens for a TU. - store raw tokens for each file (mapped either by FileID or FileEntry). - store information about preprocessor directives and macro replacements in each FileID (i.e. `MacroInvocation` in this patch) - support an operation of mapping a subrange of expanded tokens into the subrange of raw-tokens in a particular FileID. This operation might fail. Let me know if I missed something. ================ Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:146 + llvm::Optional<std::pair<unsigned, unsigned>> + toOffsetRange(const Token *Begin, const Token *End, + const SourceManager &SM) const; ---------------- sammccall wrote: > I'd think this would map a character range onto a character range, or a token > range onto a token range - is there some reason otherwise? No particular reason, this was driven by its usage in function computing replacements (not in this patch, see [[ https://github.com/ilya-biryukov/llvm-project/blob/syntax/clang/lib/Tooling/Syntax/ComputeReplacements.cpp | github prototype ]], if interested, but keep in mind the prototype is not ready for review yet). Mapping to a range of "raw" tokens seems more consistent with our model, will update accordingly, obtaining a range is easy after one has tokens from the file. ================ Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:171 + /// #pragma, etc. + llvm::ArrayRef<syntax::Token> macroTokens() const { return MacroTokens; } + ---------------- sammccall wrote: > not sure quite what this is for, but "tokens not in expanded stream" might > include header-guarded #includes, comments, ifdef-'d sections, #defines... > is this API intended to be a whitelist (macro invocations specifically) or a > blacklist (everything that's not literally in the expanded stream) This was supposed to be all **raw** tokens in a file, which are not part of the expanded token stream, i.e. all PP directives, macro-replaced-tokens (object-like and function-like macros), tokens of skipped else branches, etc. In your proposed model we would store raw tokens of a file separately and this would not be necessary. Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D59887/new/ https://reviews.llvm.org/D59887 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits