[PHP-DEV] token_get_all(): additional location information, and raw tokens

Fred Emmott Tue, 05 Jan 2016 15:58:44 -0800

I’m planning on adding this functionality in some form to HHVM, however if it’s 
also wanted in PHP, I’d rather not add something HHVM-specific and will be 
happy to put up RFCs :)


Location Information
————

token_get_all() returns a line number for some tokens. I propose adding an 
additional TOKEN_EXTENDED_LOCATION flag, that would include:

 - starting line and character number within that line
 - ending line and character number within that line

T_ENCAPSED_AND_WHITESPACE and T_INLINE_HTML seem to be the most common cases of 
start line !== end line.

Raw Tokens
————

While token_get_all() is documented as returning whatever the lexer sees, in 
practice third-party software frequently depends on specific output. This gives 
you 3 options:

1. limit changes you make to the lexer to preserve BC
2. lie about the tokens to preserve BC
3. break BC

In our experience, #3 is not practical and #1 can lead to much more complicated 
solutions for problems that would be easily fixable in the lexer - so we went 
for #2. For example, HHVM converts:

 - T_HASHBANG to T_INLINE_HTML
 - T_ELSEIF to T_ELSE T_WHITESPACE T_IF

However, this means that there’s not currently a way to get the real lexer 
tokens. I propose adding a TOKEN_RAW flag, which should explicitly allow 
implementation-specific tokens and no guarantees about output stability.
For now, this would be a no-op in PHP, however it would give you more freedom 
in modifying the lexer in the future (in combination with #2 if the flag isn’t 
specified).

With thanks,
 - Fred
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP-DEV] token_get_all(): additional location information, and raw tokens

Reply via email to