ilya-biryukov added a comment. The new version address most of the comments, there are just a few left in the code doing the mapping from Dmitri, I'll look into simplifying and removing the possibly redundant checks. Please take a look at this version, this is very close to the final state.
================ Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:72 + /// macro or a name, arguments and parentheses of a function-like macro. + llvm::ArrayRef<syntax::Token> macroTokens(const TokenBuffer &B) const; + /// The range covering macroTokens(). ---------------- ilya-biryukov wrote: > sammccall wrote: > > ilya-biryukov wrote: > > > gribozavr wrote: > > > > `invocationTokens` or `macroInvocationTokens` > > > The plan is to eventually include the macro directives tokens, hence the > > > name. > > > `invocationTokens` are somewhat inaccurate in that case, can't come up > > > with a better name. > > I think your reply applies to TokenBuffer::macroTokens(), not > > MacroInvocation::macroTokens(). > > > > +1 to invocationTokens here. > Yeah, sorry, I have mixed up these two functions with the same name. Will > change to `invocationTokens`. The Mapping is now internal and only stores the indicies. The names for two kinds of those are "raw" and "expanded", happy to consider alternatives for both. ================ Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:8 +//===----------------------------------------------------------------------===// + +#ifndef LLVM_CLANG_TOOLING_SYNTAX_TOKEN_BUFFER_H ---------------- sammccall wrote: > file comment? > sometimes they're covered by the class comments, but I think there's a bit to > say here. > in particular the logical model (how we model the preprocessor, the two token > streams and types of mappings between them) might go here. Added a file comment explaining the basic concepts inside the file. ================ Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:32 +public: + Token() = default; + Token(SourceLocation Location, unsigned Length, tok::TokenKind Kind) ---------------- sammccall wrote: > what's the default state (and why have one?) do you need a way to query > whether a token is "valid"? > (I'd avoid just relying on length() == 0 or location().isInvalid() because > it's not obvious to callers this can happen) Got rid of default state altogether. I haven' seen the use-cases where it's important so far. Using `Optional<Token>` for invalid tokens seems like a cleaner design anyway (we did not need it so far). ================ Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:60 + +/// A top-level macro invocation inside a file, e.g. +/// #define FOO 1+2 ---------------- sammccall wrote: > If you're going to say "invocation" in the name, say "use" in the comment (or > vice versa!) The class is now internal to `TokenBuffer` and is called `Mapping`. ================ Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:66 +/// macroTokens = {'BAR', '(', '1', ')'}. +class MacroInvocation { +public: ---------------- sammccall wrote: > I'd suggest a more natural order is token -> tokenbuffer -> macroinvocation. > Forward declare? > > This is because the token buffer exposes token streams, which are probably > the second-most important concept in the file (after tokens) Done. I forgot that you could have members of type `vector<T>` where `T` is incomplete. ================ Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:74 + /// The range covering macroTokens(). + std::pair<unsigned, unsigned> macroRange(const TokenBuffer &B, + const SourceManager &SM) const; ---------------- sammccall wrote: > It seems weirdly non-orthogonal to be able to get the range but not the file > ID. > > I'd suggest either adding another accessor for that, or returning a `struct > BufferRange { FileID File; unsigned Begin, End; }` (with a better name.) > > I favor the latter, because `.first` and `.second` don't offer the reader > semantic hints at the callsite, and because that gives a nice place to > document the half-open nature of the interval. > > Actually, I'd suggest adding such a struct accessor to Token too. Done. I've used the name `FileRange` instead (the idea is that it's pointing to a substring in a file). Let me know what you think about the name. ================ Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:87 + /// to the identifier token corresponding to a name of the expanded macro. + unsigned BeginMacroToken = 0; + /// End index in TokenBuffer::macroTokens(). ---------------- sammccall wrote: > please rename along with macroTokens() function This is now `BeginRawToken` and `EndRawToken`. As usually, open to suggestions wrt to naming. ================ Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:92 + +/// A list of tokens obtained by lexing and preprocessing a text buffer and a +/// set of helpers to allow mapping the tokens after preprocessing to the ---------------- sammccall wrote: > ilya-biryukov wrote: > > sammccall wrote: > > > Warning, braindump follows - let's discuss further. > > > We talked a bunch offline about the logical and concrete data model here. > > > > > > As things stand: > > > - #includes are not expanded, but will refer to a file ID with its own > > > buffer: `map<FileID, TokenBuffer>` is the whole structure > > > - no information is captured about other PP interactions (PP directives > > > that generate no tokens, skipped sections > > > - the spelled sequence of tokens is not directly available (but expanded > > > + macro invocations may be enough to reconstruct it) > > > > > > If we keep this model, I think we should spell both of these out in the > > > comment. > > > But there's another fairly natural model we could consider: > > > > > > - we have a single TokenBuffer with a flat list of all expanded tokens > > > in the TU, rather than for the file > > > - one FileID corresponds to a contiguous range of *transitive* symbols, > > > these ranges nest > > > - the TokenBuffer also stores the original tokens as `map<FileID, > > > vector<Token>>` > > > - spelled -> expanded token mapping: spelled tokens for a file are > > > partitioned into ranges (types: literal, include, macro invocation, other > > > pp directive, pp skipped region). each range maps onto a range of > > > expanded tokens (empty for e.g. pp directive or skipped region) > > > - expanded -> spelled token is similar. (It's almost the inverse of the > > > of the other mapping, we drop the "include" mapping ranges) > > > - This can naturally handle comments, preprocessor directives, pp > > > skipped sections, etc - these are in the spelled stream but not the > > > expanded stream. > > > > > > e.g. for this code > > > ``` > > > // foo.cpp > > > int a; > > > #include "foo.h" > > > int b = FOO(42); > > > // foo.h > > > #define FOO(X) X*X > > > int c; > > > ``` > > > > > > we'd have this buffer: > > > ``` > > > expandedTokens = [int a ; int c ; int b = 42 * 42 ;] > > > spelledTokens = { > > > "foo.cpp" => [int a; # include "foo.h" int b = FOO ( 42 ) ;], > > > "foo.h" => [# define FOO ( X ) X * X int c ;] > > > } > > > > > > expandedToSpelling = { > > > int => int (foo.cpp), type = literal > > > a => a > > > ; => ; > > > > > > [] => [# define FOO ( X ) X * X](foo.h), type=preprocessor directive > > > int => int (foo.h) > > > c => c > > > ; => ; > > > > > > int => int (foo.cpp) > > > b => b > > > = => = > > > [42 * 42] => [FOO ( 42 ) ](foo.cpp), type=macro invocation > > > ; => ; (foo.cpp) > > > } > > > > > > spellingToExpanded = { > > > // foo.cpp > > > int => int, type=literal > > > a => a > > > ; => ; > > > [# include "foo.h"] => [int c ;], type=include > > > int => int > > > b => b > > > = => = > > > [FOO ( X )] => [42 * 42], type=macro invocation > > > ; => ; > > > > > > // foo.h > > > [# define FOO ( X ) X] => [], type=preprocessor directive > > > int => int > > > c => c > > > ; => ; > > > } > > > ``` > > > > > > Various implementation strategies possible here, one obvious one is to > > > use a flat sorted list, and include a sequence of literal tokens as a > > > single entry. > > > > > > ``` > > > struct ExpandedSpellingMapping { > > > unsigned ExpandedStart, ExpandedEnd; > > > FileID SpellingFile; // redundant for illustration: can actually derive > > > from SpellingTokens[SpellingStart].location() > > > unsigned SpellingStart, SpellingEnd; > > > enum { LiteralSequence, MacroInvocation, Preprocessor, PPSkipped, > > > Inclusion } Kind; > > > } > > > vector<ExpandedSpellingMapping> ExpandedToSpelling; // bsearchable > > > vector<pair<FileID, ExpandedSpellingMapping>> Inclusions; // for spelling > > > -> expanded mapping. redundant: much can be taken from SourceManager. > > > ``` > > > > > > A delta-encoded representation with chunks for bsearch will likely be > > > much more compact, if this proves large. > > > (Kind of similar to the compressed posting lists in clangd Dex) > > > > > > But as-is, the mappings for the example file would be: > > > ``` > > > Expanded = { > > > {0, 3, File0, 0, 3, LiteralSequence}, // int a; > > > {3, 3, File1, 0, 8, Preprocessor}, // #define FOO(X) X * X > > > {3, 6, File1, 8, 11, LiteralSequence}, // int c; > > > {6, 9, File0, 6, 9, LiteralSequence}, // int b = > > > {9, 12, File0, 9, 13, MacroExpansion}, // FOO(42) > > > {12, 13, File0, 13, 14, LiteralSequence}, // ; > > > } > > > Inclusions = { > > > {File1, {3, 6, File0, 3, 6, Inclusion}}, // #include > > > } > > > ``` > > Thanks for raising the multi-file issue earlier rather than later. My > > original intention was to model `TokenBuffer` as a stream of expanded > > tokens **for a single FileID**, i.e. we would have multiple `TokenBuffers` > > for multiple files. > > > > Your suggestion of having a single set of expanded tokens and a map from > > FileID to vectors of raw tokens (i.e. before preprocessing) looks very > > nice. A single expanded token stream is definitely a better model for the > > syntax trees (each node of a tree would cover a continuous subrange of the > > expanded token stream). > > > > In short, here are the highlights of the proposed model that I find most > > notable are: > > - store a **single** stream of expanded tokens for a TU. > > - store raw tokens for each file (mapped either by FileID or FileEntry). > > - store information about preprocessor directives and macro replacements in > > each FileID (i.e. `MacroInvocation` in this patch) > > - support an operation of mapping a subrange of expanded tokens into the > > subrange of raw-tokens in a particular FileID. This operation might fail. > > > > Let me know if I missed something. > Yes, I think we're on the same page. > > Some more details: > > store a single stream of expanded tokens for a TU. > BTW I don't think we're giving up the ability to tokenize only a subset of > files. If we filter out files, we can just omit their spelled token stream > and omit their tokens from the expanded stream. There's complexity if we want > to reject a file and accept its includes, but I think the most important case > is "tokenize the main file only" where that's not an issue. > > > store information about preprocessor directives and macro replacements in > > each FileID > I do think it's useful to treat these as similar things - both for the > implementation of mapping between streams, and for users of the API. My hope > is that most callsites won't have to consider issues of macros, PP > directives, ifdef'd code, etc as separate issues. > > > support an operation of mapping a subrange of expanded tokens into the > > subrange of raw-tokens in a particular FileID. This operation might fail > Yes, though I don't have a great intuition for what the API should look like > - if you select part of an expansion, should the result include the macro > invocation, exclude, or fail? > - if the expansion range ends at a zero-size expansion (like a PP > directive), should it be included or excluded? > - if the result spans files, is that a failure or do we represent it somehow? > Some of these probably need to be configurable > (We also have the mapping in the other direction, which shouldn't fail) > Yes, though I don't have a great intuition for what the API should look like Agree, and I think there's room for all of these options. For the sake of simplicity in the initial implementation, I would simply pick the set of trade-offs that work for moving macro calls that span full syntax subtrees around and document them. When the use-cases pop up, we could add more configuration, refactor the APIs, etc. ================ Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:117 + TokenBuffer() = default; + // Assumes no macro replacements have taken place. + TokenBuffer(std::vector<syntax::Token> Tokens); ---------------- sammccall wrote: > `, preprocessor directives are parsed but not interpreted` > > If using the model above, `this means that the spelled and expanded streams > are identical`. (Or are we still going to strip comments?) > Removed this constructor altogether, it does not make much sense in the new model. Instead, `tokenize()` now returns the raw tokens directly in `vector<Token>`. ================ Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:146 + llvm::Optional<std::pair<unsigned, unsigned>> + toOffsetRange(const Token *Begin, const Token *End, + const SourceManager &SM) const; ---------------- ilya-biryukov wrote: > sammccall wrote: > > I'd think this would map a character range onto a character range, or a > > token range onto a token range - is there some reason otherwise? > No particular reason, this was driven by its usage in function computing > replacements (not in this patch, see [[ > https://github.com/ilya-biryukov/llvm-project/blob/syntax/clang/lib/Tooling/Syntax/ComputeReplacements.cpp > | github prototype ]], if interested, but keep in mind the prototype is not > ready for review yet). > > Mapping to a range of "raw" tokens seems more consistent with our model, will > update accordingly, obtaining a range is easy after one has tokens from the > file. Added a method to map from an expanded token range to a raw token range. Kept the method that maps an expanded token range to a character range too. Note that we don't currently have a way to map from raw token range to expanded, that could be added later, we don't need it for the initial use-case (map the ranges coming from AST nodes into raw source ranges), so I left it out for now. ================ Comment at: clang/lib/Tooling/Syntax/TokenBuffer.cpp:61 + +class TokenCollector::Callbacks : public PPCallbacks { +public: ---------------- sammccall wrote: > I'm afraid this really needs a class comment describing what this is supposed > to do (easy!) and the implementation strategy (hard!) > > In particular, it looks like macros like this: > - we expect to see the tokens making up the macro invocation first (... FOO > ( 42 )) > - then we see MacroExpands which allows us to recognize the last N tokens > are a macro invocation. We create a MacroInvocation object, and remember > where the invocation ends > - then we see the tokens making up the macro expansion > - finally, once we see the next token after the invocation, we finalize the > MacroInvocation. Added a comment. The model is now a bit simpler (more work is done by the preprocessor), but let me know if the comment is still unclear and could be improved. ================ Comment at: clang/lib/Tooling/Syntax/TokenBuffer.cpp:285 + std::tie(File, Offset) = SM.getDecomposedLoc(L); + if (File != MacroInvocationFile || Offset <= ExpansionEndOffset) + return; ---------------- sammccall wrote: > is there a problem if we never see another token in that file after the > expansion? > (or do we see the eof token in this case?) Yeah, there should always be an eof. Added a comment and a test for this. Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D59887/new/ https://reviews.llvm.org/D59887 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits