On Mon, Mar 18, 2019 at 2:43 PM Nikita Popov <nikita....@gmail.com> wrote:

> On Thu, Mar 14, 2019 at 8:33 PM C. Scott Ananian <canan...@wikimedia.org>
> wrote:
>
>> I'm floating an idea for an RFC here.
>>
>> I'm working on the wikimedia/remex-html library for high-performance
>> PHP-native HTML5 parsing.  When creating a high-performance lexer, it is
>> worthwhile to try to reduce the number of string copies made.  You can
>> generally perform matches using offsets into your master source string.
>> However, preg_match* will copy a substring for the entire matched region
>> ($matches[0]) as well as for all captured patterns ($matches[1...n]).
>> These substring copies can get expensive if the matched region/captured
>> patterns are very large.
>>
>> It would be helpful if PHP's preg_match* functions offered a flag, say
>> PREG_LENGTH_CAPTURE, which returned the numeric length instead of the
>> matched/captured string.  In combination,
>> PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE would return the numeric length in
>> element 0 and the numeric offset in element 1, and avoid the need to copy
>> the matched substring unnecessarily.  This would allow greatly reducing
>> the
>> number of substring copies made during lexing.
>>
>
> Generally sounds reasonable to me. Do you maybe have a sample input and
> regular expression where you suspect this is a particularly large problem,
> so we can test how much of a difference this makes?
>

After thinking about this some more, while this may be a minor performance
improvement, it still does more work than necessary. In particular the use
of OFFSET_CAPTURE (which would be pretty much required here) needs one new
two-element array for each subpattern. If the captured strings are short,
this is where the main cost is going to be.

I'm wondering if we shouldn't consider a new object oriented API for PCRE
which can return a match object where subpattern positions and contents can
be queried via method calls, so you only pay for the parts that you do
access.

Nikita

Reply via email to