I'm floating an idea for an RFC here.

I'm working on the wikimedia/remex-html library for high-performance
PHP-native HTML5 parsing.  When creating a high-performance lexer, it is
worthwhile to try to reduce the number of string copies made.  You can
generally perform matches using offsets into your master source string.
However, preg_match* will copy a substring for the entire matched region
($matches[0]) as well as for all captured patterns ($matches[1...n]).
These substring copies can get expensive if the matched region/captured
patterns are very large.

It would be helpful if PHP's preg_match* functions offered a flag, say
PREG_LENGTH_CAPTURE, which returned the numeric length instead of the
matched/captured string.  In combination,
PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE would return the numeric length in
element 0 and the numeric offset in element 1, and avoid the need to copy
the matched substring unnecessarily.  This would allow greatly reducing the
number of substring copies made during lexing.

Thoughts?
 --scott

ps. more ambitious would be to introduce a new "substring" type, which
would share the allocation of a parent string with its own offset and
length fields.  That would probably be as invasive as the ZVAL_INTERNED_STR
type, though -- a much much bigger project.

pps. while I'm wishing -- preg_replace would benefit from some way to pass
options, so that (for example) you could pass PREG_OFFSET_CAPTURE and get
the offset for each replacement match.  Knowing the offset of the match
allows you to do (for example) error reporting from the callback function.

-- 
(http://cscott.net)

Reply via email to