On Mon, Mar 18, 2019 at 2:43 PM Nikita Popov <nikita....@gmail.com> wrote:
> On Thu, Mar 14, 2019 at 8:33 PM C. Scott Ananian <canan...@wikimedia.org> > wrote: > >> I'm floating an idea for an RFC here. >> >> I'm working on the wikimedia/remex-html library for high-performance >> PHP-native HTML5 parsing. When creating a high-performance lexer, it is >> worthwhile to try to reduce the number of string copies made. You can >> generally perform matches using offsets into your master source string. >> However, preg_match* will copy a substring for the entire matched region >> ($matches[0]) as well as for all captured patterns ($matches[1...n]). >> These substring copies can get expensive if the matched region/captured >> patterns are very large. >> >> It would be helpful if PHP's preg_match* functions offered a flag, say >> PREG_LENGTH_CAPTURE, which returned the numeric length instead of the >> matched/captured string. In combination, >> PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE would return the numeric length in >> element 0 and the numeric offset in element 1, and avoid the need to copy >> the matched substring unnecessarily. This would allow greatly reducing >> the >> number of substring copies made during lexing. >> > > Generally sounds reasonable to me. Do you maybe have a sample input and > regular expression where you suspect this is a particularly large problem, > so we can test how much of a difference this makes? > After thinking about this some more, while this may be a minor performance improvement, it still does more work than necessary. In particular the use of OFFSET_CAPTURE (which would be pretty much required here) needs one new two-element array for each subpattern. If the captured strings are short, this is where the main cost is going to be. I'm wondering if we shouldn't consider a new object oriented API for PCRE which can return a match object where subpattern positions and contents can be queried via method calls, so you only pay for the parts that you do access. Nikita