Re: [PHP-DEV] Offset-only results from preg_match

2019-03-28 Thread C. Scott Ananian
On Wed, Mar 27, 2019 at 5:41 PM C. Scott Ananian wrote: > I've created https://github.com/php/php-src/pull/3994 implementing this > fix, and confirmed that it is sufficient to get my large regexp interned > when it is rewritten as a class constant referencing > HTMLData::NAMED_ENTITY_REGEX. > Af

Re: [PHP-DEV] Offset-only results from preg_match

2019-03-27 Thread C. Scott Ananian
On Wed, Mar 27, 2019 at 2:30 PM C. Scott Ananian wrote: > Continuing this saga: I'm still having performance problems on character > entity expansion. Here's the baseline code: > https://github.com/wikimedia/remex-html/blob/master/RemexHtml/Tokenizer/Tokenizer.php#L881 > Of note: the regular exp

Re: [PHP-DEV] Offset-only results from preg_match

2019-03-27 Thread C. Scott Ananian
Continuing this saga: I'm still having performance problems on character entity expansion. Here's the baseline code: https://github.com/wikimedia/remex-html/blob/master/RemexHtml/Tokenizer/Tokenizer.php#L881 Of note: the regular expression is quite large -- around 26kB -- because it needs to inclu

Re: [PHP-DEV] Offset-only results from preg_match

2019-03-26 Thread Nikita Popov
On Sat, Mar 23, 2019 at 4:07 PM C. Scott Ananian wrote: > Yup, testing via CLI but Wikimedia will (eventually) be running PHP 7.x > with opcache ( https://phabricator.wikimedia.org/T176370 / > https://phabricator.wikimedia.org/T211964 ). It would be nice to fix the > CLI to behave more like the

Re: [PHP-DEV] Offset-only results from preg_match

2019-03-23 Thread C. Scott Ananian
Yup, testing via CLI but Wikimedia will (eventually) be running PHP 7.x with opcache ( https://phabricator.wikimedia.org/T176370 / https://phabricator.wikimedia.org/T211964 ). It would be nice to fix the CLI to behave more like the server wrt interned strings. It certainly would make benchmarking

Re: [PHP-DEV] Offset-only results from preg_match

2019-03-23 Thread Nikita Popov
On Sat, Mar 23, 2019 at 6:32 AM C. Scott Ananian wrote: > So... > > In microbenchmarks you can clearly see the improvement: > ``` > >>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m, > PREG_OFFSET_CAPTURE); > => 39 > Command took 0.001709 seconds on average (0.001654 median; 0.854503

Re: [PHP-DEV] Offset-only results from preg_match

2019-03-22 Thread C. Scott Ananian
So... In microbenchmarks you can clearly see the improvement: ``` >>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m, PREG_OFFSET_CAPTURE); => 39 Command took 0.001709 seconds on average (0.001654 median; 0.854503 total) to complete. >>> timeit -n500 preg_match_all('/(.{65535})/s', $htm

Re: [PHP-DEV] Offset-only results from preg_match

2019-03-21 Thread C. Scott Ananian
ps. Just to put some numbers to it, using `psysh` on $html100 which contains the (Parsoid format) HTML for the [[en:Barack Obama]] article on Wikipedia. ``` >>> strlen($html100) => 2592386 >>> timeit -n1000 preg_match_all( '/(b)/', $html100, $m, PREG_OFFSET_CAPTURE ); => 22062 Command took 0.00864

Re: [PHP-DEV] Offset-only results from preg_match

2019-03-21 Thread C. Scott Ananian
Quick status update. I tried to prototype this in pure PHP in the wikimedia/remex-html library using (?= .. ) around each regexp and ()...() around each captured expression (replacing the capture parens) to effectively bypass the string copy and return a bunch of zero-length strings. That didn't

Re: [PHP-DEV] Offset-only results from preg_match

2019-03-21 Thread Nikita Popov
On Wed, Mar 20, 2019 at 4:35 PM C. Scott Ananian wrote: > On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov > wrote: > >> After thinking about this some more, while this may be a minor >> performance improvement, it still does more work than necessary. In >> particular the use of OFFSET_CAPTURE (whi

Re: [PHP-DEV] Offset-only results from preg_match

2019-03-20 Thread C. Scott Ananian
On Mon, Mar 18, 2019 at 9:44 AM Nikita Popov wrote: > On Thu, Mar 14, 2019 at 8:33 PM C. Scott Ananian > wrote: > >> I'm floating an idea for an RFC here. >> >> I'm working on the wikimedia/remex-html library for high-performance >> PHP-native HTML5 parsing. When creating a high-performance lex

Re: [PHP-DEV] Offset-only results from preg_match

2019-03-20 Thread C. Scott Ananian
On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov wrote: > After thinking about this some more, while this may be a minor performance > improvement, it still does more work than necessary. In particular the use > of OFFSET_CAPTURE (which would be pretty much required here) needs one new > two-element

Re: [PHP-DEV] Offset-only results from preg_match

2019-03-20 Thread Markus Fischer
On 19.03.19 15:58, Nikita Popov wrote: I'm wondering if we shouldn't consider a new object oriented API for PCRE which can return a match object where subpattern positions and contents can be queried via method calls, so you only pay for the parts that you do access. Or also a literal syntax wo

Re: [PHP-DEV] Offset-only results from preg_match

2019-03-19 Thread Christoph M. Becker
On 19.03.2019 at 15:58, Nikita Popov wrote: > I'm wondering if we shouldn't consider a new object oriented API for PCRE > which can return a match object where subpattern positions and contents can > be queried via method calls, so you only pay for the parts that you do > access. +1 -- Christoph

Re: [PHP-DEV] Offset-only results from preg_match

2019-03-19 Thread Nikita Popov
On Mon, Mar 18, 2019 at 2:43 PM Nikita Popov wrote: > On Thu, Mar 14, 2019 at 8:33 PM C. Scott Ananian > wrote: > >> I'm floating an idea for an RFC here. >> >> I'm working on the wikimedia/remex-html library for high-performance >> PHP-native HTML5 parsing. When creating a high-performance lex

Re: [PHP-DEV] Offset-only results from preg_match

2019-03-18 Thread Nikita Popov
On Thu, Mar 14, 2019 at 8:33 PM C. Scott Ananian wrote: > I'm floating an idea for an RFC here. > > I'm working on the wikimedia/remex-html library for high-performance > PHP-native HTML5 parsing. When creating a high-performance lexer, it is > worthwhile to try to reduce the number of string co

Re: [PHP-DEV] Offset-only results from preg_match

2019-03-16 Thread Markus Fischer
On 14.03.19 20:33, C. Scott Ananian wrote: ps. more ambitious would be to introduce a new "substring" type, which would share the allocation of a parent string with its own offset and length fields. That would probably be as invasive as the ZVAL_INTERNED_STR type, though -- a much much bigger pr