Hi all, I think I first realized that PHP's scanner splits non-constant strings into many "pieces" after reading Sara's "How long is a piece of string?" blog entry[1] last summer. At the time I didn't know much about the internals and didn't know if anything could be done to change it. Then in the fall I finally took a look at the scanner ;-) and thought it would be possible to only "split" strings at variables. Finally a few months ago, I began working out the changes -- it was working almost 2 months ago, but then I got sidetracked :-/ from doing some more testing and making a few semantic token changes till now.
So anyway, now heredocs and interpolated strings should be pretty much just like constant strings and concatenation (except for the extra INIT_STRING opcode). They scan/parse/compile faster (with less memory), run faster, and there's less to free when destroying opcodes. With a simple string like "This is $var string" (say $var = 'some'), I found the compile/cleanup time to be up to 50% faster, and runtime 55% faster! (Note: To test compile time, I eval()'d about 50 of them in an if (0) {...} block.) The difference will be *much more* depending on how many "pieces" there would've been before (e.g. longer). The more complex rules increased the size of Flex's tables about 40%. However, removing the old heredoc end rule, which used the ^ beginning-of-line operator, made the YY_RULE_SETUP macro be empty, saving some space. The net result was an 8K/12K larger binary in 5.2/HEAD. I was surprised at the overall performance increase without the ^ rule. Its saving a few operations per match made just about as much difference as Flex's -Cfe table compression (was playing with that first :^)) when compiling the code from Zend/bench.php (5% I think). This was with a Windows ZTS build. Running ApacheBench on a few different scripts showed pretty nice overall improvements -- 10-15% was common in my quick tests. BTW, removing that ^ rule lifts the requirement that the character before the closing heredoc label "must be a newline as defined by your operating system," to quote the manual. Now some of the other changes: The ST_SINGLE_QUOTE state was removed from 5.2, like in HEAD. A string like "$$$" is considered constant now, since that's really what it is, right? CG(zend_lineno) wasn't incremented before if a \n or \r newline (not \r\n) followed a backslash in a non-constant string. \{ returned T_STRING instead of T_BAD_CHARACTER like any other invalid escape sequence. (Note: Of course these won't usually match now anyway, but will be part of a longer string.) I removed HANDLE_NEWLINES() from the code that scans a string's text, instead doing the newline check in the escape-checking loop, to prevent scanning twice. And I removed the additional boundary check in HANDLE_NEWLINES() and elsewhere since I didn't see the need -- AFAIK in all cases you'll only hit '\0'. I removed the one <<EOF>> rule since it was missing some states and it wasn't doing anything that the default EOF rule doesn't by calling yyterminate(). In zendlex(), the goto target doesn't need to recheck CG(increment_lineno) since it hasn't changed, and I simplified the closing tag newline check (also looked like it would miss \r ones). Sorry for the long message! I'll send another if I think of something I forgot to mention. Here are the patches: http://realplain.com/php/scanner_optimizations.diff http://realplain.com/php/scanner_optimizations_5_2.diff Appreciate any feedback, or questions about any of it. :-) Thanks, Matt [1] http://blog.libssh2.org/index.php?/archives/28-How-long-is-a-piece-of-string.html -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php