On Mon, Mar 18, 2019 at 9:44 AM Nikita Popov <nikita....@gmail.com> wrote:
> On Thu, Mar 14, 2019 at 8:33 PM C. Scott Ananian <canan...@wikimedia.org> > wrote: > >> I'm floating an idea for an RFC here. >> >> I'm working on the wikimedia/remex-html library for high-performance >> PHP-native HTML5 parsing. When creating a high-performance lexer, it is >> worthwhile to try to reduce the number of string copies made. You can >> generally perform matches using offsets into your master source string. >> However, preg_match* will copy a substring for the entire matched region >> ($matches[0]) as well as for all captured patterns ($matches[1...n]). >> These substring copies can get expensive if the matched region/captured >> patterns are very large. >> >> It would be helpful if PHP's preg_match* functions offered a flag, say >> PREG_LENGTH_CAPTURE, which returned the numeric length instead of the >> matched/captured string. In combination, >> PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE would return the numeric length in >> element 0 and the numeric offset in element 1, and avoid the need to copy >> the matched substring unnecessarily. This would allow greatly reducing >> the >> number of substring copies made during lexing. >> > > Generally sounds reasonable to me. Do you maybe have a sample input and > regular expression where you suspect this is a particularly large problem, > so we can test how much of a difference this makes? > I'm going to work on emulating this today by changing as many of the captures in remex-html to zero-length captures at the start/end of the region; that should give me a reasonable idea of performance gain. > pps. while I'm wishing -- preg_replace would benefit from some way to pass > >> options, so that (for example) you could pass PREG_OFFSET_CAPTURE and get >> the offset for each replacement match. Knowing the offset of the match >> allows you to do (for example) error reporting from the callback function. >> > > I've implemented this bit in https://github.com/php/php-src/pull/3958. > I notice that this has been merged already. Looks great, especially breaking out the creation of the matches array into a reusable function. That would make future additions easier/more consistent. --scott -- (http://cscott.net)