Hi all,

I think I first realized that PHP's scanner splits non-constant strings into
many "pieces" after reading Sara's "How long is a piece of string?" blog
entry[1] last summer.  At the time I didn't know much about the internals
and didn't know if anything could be done to change it.  Then in the fall I
finally took a look at the scanner ;-) and thought it would be possible to
only "split" strings at variables.  Finally a few months ago, I began
working out the changes -- it was working almost 2 months ago, but then I
got sidetracked :-/ from doing some more testing and making a few semantic
token changes till now.

So anyway, now heredocs and interpolated strings should be pretty much just
like constant strings and concatenation (except for the extra INIT_STRING
opcode).  They scan/parse/compile faster (with less memory), run faster, and
there's less to free when destroying opcodes.

With a simple string like "This is $var string" (say $var = 'some'), I found
the compile/cleanup time to be up to 50% faster, and runtime 55% faster!
(Note: To test compile time, I eval()'d about 50 of them in an if (0) {...}
block.)  The difference will be *much more* depending on how many "pieces"
there would've been before (e.g. longer).

The more complex rules increased the size of Flex's tables about 40%.
However, removing the old heredoc end rule, which used the ^
beginning-of-line operator, made the YY_RULE_SETUP macro be empty, saving
some space.  The net result was an 8K/12K larger binary in 5.2/HEAD.  I was
surprised at the overall performance increase without the ^ rule.  Its
saving a few operations per match made just about as much difference as
Flex's -Cfe table compression (was playing with that first :^)) when
compiling the code from Zend/bench.php (5% I think).

This was with a Windows ZTS build.  Running ApacheBench on a few different
scripts showed pretty nice overall improvements -- 10-15% was common in my
quick tests.

BTW, removing that ^ rule lifts the requirement that the character before
the closing heredoc label "must be a newline as defined by your operating
system," to quote the manual.

Now some of the other changes:

The ST_SINGLE_QUOTE state was removed from 5.2, like in HEAD.

A string like "$$$" is considered constant now, since that's really what it
is, right?

CG(zend_lineno) wasn't incremented before if a \n or \r newline (not \r\n)
followed a backslash in a non-constant string.  \{ returned T_STRING instead
of T_BAD_CHARACTER like any other invalid escape sequence.  (Note: Of course
these won't usually match now anyway, but will be part of a longer string.)

I removed HANDLE_NEWLINES() from the code that scans a string's text,
instead doing the newline check in the escape-checking loop, to prevent
scanning twice.  And I removed the additional boundary check in
HANDLE_NEWLINES() and elsewhere since I didn't see the need -- AFAIK in all
cases you'll only hit '\0'.

I removed the one <<EOF>> rule since it was missing some states and it
wasn't doing anything that the default EOF rule doesn't by calling
yyterminate().

In zendlex(), the goto target doesn't need to recheck CG(increment_lineno)
since it hasn't changed, and I simplified the closing tag newline check
(also looked like it would miss \r ones).

Sorry for the long message!  I'll send another if I think of something I
forgot to mention.  Here are the patches:

http://realplain.com/php/scanner_optimizations.diff
http://realplain.com/php/scanner_optimizations_5_2.diff

Appreciate any feedback, or questions about any of it. :-)


Thanks,
Matt

[1]
http://blog.libssh2.org/index.php?/archives/28-How-long-is-a-piece-of-string.html

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to