On Thu, Mar 14, 2019 at 8:33 PM C. Scott Ananian <canan...@wikimedia.org>
wrote:

> I'm floating an idea for an RFC here.
>
> I'm working on the wikimedia/remex-html library for high-performance
> PHP-native HTML5 parsing.  When creating a high-performance lexer, it is
> worthwhile to try to reduce the number of string copies made.  You can
> generally perform matches using offsets into your master source string.
> However, preg_match* will copy a substring for the entire matched region
> ($matches[0]) as well as for all captured patterns ($matches[1...n]).
> These substring copies can get expensive if the matched region/captured
> patterns are very large.
>
> It would be helpful if PHP's preg_match* functions offered a flag, say
> PREG_LENGTH_CAPTURE, which returned the numeric length instead of the
> matched/captured string.  In combination,
> PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE would return the numeric length in
> element 0 and the numeric offset in element 1, and avoid the need to copy
> the matched substring unnecessarily.  This would allow greatly reducing the
> number of substring copies made during lexing.
>

Generally sounds reasonable to me. Do you maybe have a sample input and
regular expression where you suspect this is a particularly large problem,
so we can test how much of a difference this makes?


> Thoughts?
>  --scott
>
> ps. more ambitious would be to introduce a new "substring" type, which
> would share the allocation of a parent string with its own offset and
> length fields.  That would probably be as invasive as the ZVAL_INTERNED_STR
> type, though -- a much much bigger project.
>
> pps. while I'm wishing -- preg_replace would benefit from some way to pass
> options, so that (for example) you could pass PREG_OFFSET_CAPTURE and get
> the offset for each replacement match.  Knowing the offset of the match
> allows you to do (for example) error reporting from the callback function.
>

I've implemented this bit in https://github.com/php/php-src/pull/3958.

Nikita

Reply via email to