On Mon, Mar 18, 2019 at 9:44 AM Nikita Popov <nikita....@gmail.com> wrote:

> On Thu, Mar 14, 2019 at 8:33 PM C. Scott Ananian <canan...@wikimedia.org>
> wrote:
>
>> I'm floating an idea for an RFC here.
>>
>> I'm working on the wikimedia/remex-html library for high-performance
>> PHP-native HTML5 parsing.  When creating a high-performance lexer, it is
>> worthwhile to try to reduce the number of string copies made.  You can
>> generally perform matches using offsets into your master source string.
>> However, preg_match* will copy a substring for the entire matched region
>> ($matches[0]) as well as for all captured patterns ($matches[1...n]).
>> These substring copies can get expensive if the matched region/captured
>> patterns are very large.
>>
>> It would be helpful if PHP's preg_match* functions offered a flag, say
>> PREG_LENGTH_CAPTURE, which returned the numeric length instead of the
>> matched/captured string.  In combination,
>> PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE would return the numeric length in
>> element 0 and the numeric offset in element 1, and avoid the need to copy
>> the matched substring unnecessarily.  This would allow greatly reducing
>> the
>> number of substring copies made during lexing.
>>
>
> Generally sounds reasonable to me. Do you maybe have a sample input and
> regular expression where you suspect this is a particularly large problem,
> so we can test how much of a difference this makes?
>

I'm going to work on emulating this today by changing as many of the
captures in remex-html to zero-length captures at the start/end of the
region; that should give me a reasonable idea of performance gain.


> pps. while I'm wishing -- preg_replace would benefit from some way to pass
>
>> options, so that (for example) you could pass PREG_OFFSET_CAPTURE and get
>> the offset for each replacement match.  Knowing the offset of the match
>> allows you to do (for example) error reporting from the callback function.
>>
>
> I've implemented this bit in https://github.com/php/php-src/pull/3958.
>

I notice that this has been merged already.  Looks great, especially
breaking out the creation of the matches array into a reusable function.
That would make future additions easier/more consistent.
 --scott

-- 
(http://cscott.net)

Reply via email to