On Wed, Mar 27, 2019 at 5:41 PM C. Scott Ananian
wrote:
> I've created https://github.com/php/php-src/pull/3994 implementing this
> fix, and confirmed that it is sufficient to get my large regexp interned
> when it is rewritten as a class constant referencing
> HTMLData::NAMED_ENTITY_REGEX.
>
Af
On Wed, Mar 27, 2019 at 2:30 PM C. Scott Ananian
wrote:
> Continuing this saga: I'm still having performance problems on character
> entity expansion. Here's the baseline code:
> https://github.com/wikimedia/remex-html/blob/master/RemexHtml/Tokenizer/Tokenizer.php#L881
> Of note: the regular exp
Continuing this saga: I'm still having performance problems on character
entity expansion. Here's the baseline code:
https://github.com/wikimedia/remex-html/blob/master/RemexHtml/Tokenizer/Tokenizer.php#L881
Of note: the regular expression is quite large -- around 26kB -- because it
needs to inclu
On Sat, Mar 23, 2019 at 4:07 PM C. Scott Ananian
wrote:
> Yup, testing via CLI but Wikimedia will (eventually) be running PHP 7.x
> with opcache ( https://phabricator.wikimedia.org/T176370 /
> https://phabricator.wikimedia.org/T211964 ). It would be nice to fix the
> CLI to behave more like the
Yup, testing via CLI but Wikimedia will (eventually) be running PHP 7.x
with opcache ( https://phabricator.wikimedia.org/T176370 /
https://phabricator.wikimedia.org/T211964 ). It would be nice to fix the
CLI to behave more like the server wrt interned strings. It certainly
would make benchmarking
On Sat, Mar 23, 2019 at 6:32 AM C. Scott Ananian
wrote:
> So...
>
> In microbenchmarks you can clearly see the improvement:
> ```
> >>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m,
> PREG_OFFSET_CAPTURE);
> => 39
> Command took 0.001709 seconds on average (0.001654 median; 0.854503
So...
In microbenchmarks you can clearly see the improvement:
```
>>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m,
PREG_OFFSET_CAPTURE);
=> 39
Command took 0.001709 seconds on average (0.001654 median; 0.854503 total)
to complete.
>>> timeit -n500 preg_match_all('/(.{65535})/s', $htm
ps. Just to put some numbers to it, using `psysh` on $html100 which
contains the (Parsoid format) HTML for the [[en:Barack Obama]] article on
Wikipedia.
```
>>> strlen($html100)
=> 2592386
>>> timeit -n1000 preg_match_all( '/(b)/', $html100, $m,
PREG_OFFSET_CAPTURE );
=> 22062
Command took 0.00864
Quick status update. I tried to prototype this in pure PHP in the
wikimedia/remex-html library using (?= .. ) around each regexp and ()...()
around each captured expression (replacing the capture parens) to
effectively bypass the string copy and return a bunch of zero-length
strings. That didn't
On Wed, Mar 20, 2019 at 4:35 PM C. Scott Ananian
wrote:
> On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov
> wrote:
>
>> After thinking about this some more, while this may be a minor
>> performance improvement, it still does more work than necessary. In
>> particular the use of OFFSET_CAPTURE (whi
On Mon, Mar 18, 2019 at 9:44 AM Nikita Popov wrote:
> On Thu, Mar 14, 2019 at 8:33 PM C. Scott Ananian
> wrote:
>
>> I'm floating an idea for an RFC here.
>>
>> I'm working on the wikimedia/remex-html library for high-performance
>> PHP-native HTML5 parsing. When creating a high-performance lex
On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov wrote:
> After thinking about this some more, while this may be a minor performance
> improvement, it still does more work than necessary. In particular the use
> of OFFSET_CAPTURE (which would be pretty much required here) needs one new
> two-element
On 19.03.19 15:58, Nikita Popov wrote:
I'm wondering if we shouldn't consider a new object oriented API for PCRE
which can return a match object where subpattern positions and contents can
be queried via method calls, so you only pay for the parts that you do
access.
Or also a literal syntax wo
On 19.03.2019 at 15:58, Nikita Popov wrote:
> I'm wondering if we shouldn't consider a new object oriented API for PCRE
> which can return a match object where subpattern positions and contents can
> be queried via method calls, so you only pay for the parts that you do
> access.
+1
--
Christoph
On Mon, Mar 18, 2019 at 2:43 PM Nikita Popov wrote:
> On Thu, Mar 14, 2019 at 8:33 PM C. Scott Ananian
> wrote:
>
>> I'm floating an idea for an RFC here.
>>
>> I'm working on the wikimedia/remex-html library for high-performance
>> PHP-native HTML5 parsing. When creating a high-performance lex
On Thu, Mar 14, 2019 at 8:33 PM C. Scott Ananian
wrote:
> I'm floating an idea for an RFC here.
>
> I'm working on the wikimedia/remex-html library for high-performance
> PHP-native HTML5 parsing. When creating a high-performance lexer, it is
> worthwhile to try to reduce the number of string co
On 14.03.19 20:33, C. Scott Ananian wrote:
ps. more ambitious would be to introduce a new "substring" type, which
would share the allocation of a parent string with its own offset and
length fields. That would probably be as invasive as the ZVAL_INTERNED_STR
type, though -- a much much bigger pr
17 matches
Mail list logo