On Wed, Mar 20, 2019 at 4:35 PM C. Scott Ananian <canan...@wikimedia.org> wrote:
> On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov <nikita....@gmail.com> > wrote: > >> After thinking about this some more, while this may be a minor >> performance improvement, it still does more work than necessary. In >> particular the use of OFFSET_CAPTURE (which would be pretty much required >> here) needs one new two-element array for each subpattern. If the captured >> strings are short, this is where the main cost is going to be. >> > > The primary use of this feature is when the captured strings are *long*, > as that's when we most want to avoid copying a substring. > > >> I'm wondering if we shouldn't consider a new object oriented API for PCRE >> which can return a match object where subpattern positions and contents can >> be queried via method calls, so you only pay for the parts that you do >> access. >> > > Seems like this is letting the perfect be the enemy of the good. The > LENGTH_CAPTURE significantly reduces allocation for long match strings, and > it allocates the same two-element arrays that OFFSET_CAPTURE would -- it > just stores an integer where there would otherwise be an expensive > substring. Furthermore, since the array structure is left mostly alone, it > would be not-too-hard to support earlier-PHP versions, with something like: > > $hasLengthCapture = defined('PREG_LENGTH_CAPTURE') ? PREG_LENGTH_CAPTURE : > 0; > $r = preg_match($pat, $sub, $m, PREG_OFFSET_CAPTURE | $hasLengthCapture); > $matchOneLength = $hasLengthCapture ? $m[1][0] : strlen($m[1][0]); > $matchOneOffset = $m[1][1]; > > If you introduce a whole new OO accessor object, it starts becoming very > hard to write backward-compatible code. > --scott > Fair enough. I've created https://github.com/php/php-src/pull/3971 to implement this feature. It would be good to have some confirmation that this is really a significant performance improvement before we land it though. Nikita