On Thu, Mar 14, 2019 at 8:33 PM C. Scott Ananian <canan...@wikimedia.org> wrote:
> I'm floating an idea for an RFC here. > > I'm working on the wikimedia/remex-html library for high-performance > PHP-native HTML5 parsing. When creating a high-performance lexer, it is > worthwhile to try to reduce the number of string copies made. You can > generally perform matches using offsets into your master source string. > However, preg_match* will copy a substring for the entire matched region > ($matches[0]) as well as for all captured patterns ($matches[1...n]). > These substring copies can get expensive if the matched region/captured > patterns are very large. > > It would be helpful if PHP's preg_match* functions offered a flag, say > PREG_LENGTH_CAPTURE, which returned the numeric length instead of the > matched/captured string. In combination, > PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE would return the numeric length in > element 0 and the numeric offset in element 1, and avoid the need to copy > the matched substring unnecessarily. This would allow greatly reducing the > number of substring copies made during lexing. > Generally sounds reasonable to me. Do you maybe have a sample input and regular expression where you suspect this is a particularly large problem, so we can test how much of a difference this makes? > Thoughts? > --scott > > ps. more ambitious would be to introduce a new "substring" type, which > would share the allocation of a parent string with its own offset and > length fields. That would probably be as invasive as the ZVAL_INTERNED_STR > type, though -- a much much bigger project. > > pps. while I'm wishing -- preg_replace would benefit from some way to pass > options, so that (for example) you could pass PREG_OFFSET_CAPTURE and get > the offset for each replacement match. Knowing the offset of the match > allows you to do (for example) error reporting from the callback function. > I've implemented this bit in https://github.com/php/php-src/pull/3958. Nikita