Re: [PHP-DEV] Offset-only results from preg_match

Nikita Popov Tue, 26 Mar 2019 02:31:26 -0700

On Sat, Mar 23, 2019 at 4:07 PM C. Scott Ananian <canan...@wikimedia.org>
wrote:


> Yup, testing via CLI but Wikimedia will (eventually) be running PHP 7.x
> with opcache ( https://phabricator.wikimedia.org/T176370 /
> https://phabricator.wikimedia.org/T211964 ).  It would be nice to fix the
> CLI to behave more like the server wrt interned strings.  It certainly
> would make benchmarking easier/more representative, but we also have a
> bunch of testing/CI stuff running in CLI mode so speedups there are
> definitely useful, even if they don't "speed up Wikipedia" per se.
>
> Ignoring the zend_hash_find costs for the moment, php_pcre_match_impl
> costs 492,077,935 cycles in this benchmark, of which only 211,626,095 are
> spent doing the actual php_pcre2_jit_match.  140,000,069 cycles are spent
> in populate_subpat_array and 108,742,309 are spent in the zval_ptr_dtor
> call on the penultimate line -- basically freeing the $matches array from
> the previous call to pcre_match (and this is with PREG_LENGTH_CAPTURE).  So
> there's still (in theory) a >2x speedup in preg matching available by
> tuning how the subpat_array works and making it less costly to
> allocate/free $matches.
>   --scott
>

The aforementioned optimization to the cache lookup on CLI is now
implemented via https://github.com/php/php-src/pull/3985.

There is more that can be improved here, but it falls more into the realm
of micro-optimization... For example, we can specialize the creation of the
offset pairs to reduce the overhead (
https://github.com/php/php-src/pull/3990), but this still remains the most
expensive part of the subpat population...

Nikita


> On Sat, Mar 23, 2019 at 6:13 AM Nikita Popov <nikita....@gmail.com> wrote:
>
>> On Sat, Mar 23, 2019 at 6:32 AM C. Scott Ananian <canan...@wikimedia.org>
>> wrote:
>>
>>> So...
>>>
>>> In microbenchmarks you can clearly see the improvement:
>>> ```
>>> >>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m,
>>> PREG_OFFSET_CAPTURE);
>>> => 39
>>> Command took 0.001709 seconds on average (0.001654 median; 0.854503
>>> total) to complete.
>>> >>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m,
>>> PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE);
>>> => 39
>>> Command took 0.000991 seconds on average (0.000965 median; 0.495287
>>> total) to complete.
>>> ```
>>> $html100 is my usual 2,592,386 byte HTML of [[en:Barack Obama]].
>>>
>>> But unless you're matching 64k strings like this, it's hard to see a
>>> practical improvement.  In my remex-html html parsing benchmark, although
>>> LENGTH_CAPTURE doesn't make things slower, it doesn't make a significant
>>> performance improvement.  I built php statically and ran it through
>>> cachegrind to try to figure out why, and found:
>>>
>>> 2,567,559,670 cycles: total spent executing the tokenizer benchmark
>>> (including reading the file from disk)
>>> 1,018,845,290 cycles in zif_preg_match.  Optimizing regexps is important
>>> for tokenizers! Of these, we spend
>>>   575,478,637 doing the actual match (preg_pcre_match_impl) and
>>>   435,162,131 getting the regexp from the cache (!)
>>> (pcre_get_compiled_regex_cache)
>>>
>>> This is for 128,794 total regexp matches performed by the tokenizer on
>>> this input.
>>>
>>> Of those cycles getting the regex from cache, only 24,116,319 are spent
>>> on cache misses where we are actually compiling regexps (sum of
>>> php_pcre2_jit_compile and php_pcre2_compile).
>>> Instead, 406,007,383 cycles are spent in zend_hash_find().  That's 40%
>>> of the total time spent executing preg_match.
>>>
>>> The LENGTH_CAPTURE optimization does reduce the total time spent in
>>> populate_subpat_array from 160,951,690 cycles to 140,042,331 cycles in the
>>> remex-html tokenizer on this benchmark, but that difference is overwhelmed
>>> by (for example) the time spent in zend_hash_find().
>>>
>>> The slowdown in zend_hash_find() appears to be due to
>>> https://bugs.php.net/bug.php?id=63180 which disabled interned keys in
>>> the pcre_cache table.  Because of this, even if the regexs are interned, we
>>> must pay for a complete zend_string_equal_content() on each match, which
>>> takes time proportional to the length of the regex.  This can be quite
>>> large -- for example, for the HTML5 character reference regex in
>>> remex-html, which contains every valid entity name and is 26,137 bytes
>>> long, and we need to do a zend_string_equal_content() on the 26,137 byte
>>> regexp for every ~5 byte entity in the parsed HTML.
>>>   --scott
>>>
>>
>> Thanks for testing! That's an interesting result. We should be able to do
>> something about this. There are basically three cases:
>>
>> 1. CLI (presumably what you're testing). Strings are interned
>> per-request, but there is only one request.
>> 2. Server w/o opcache. Strings are interned per-request and there may be
>> multiple requests.
>> 3. Server with opcache. Strings are interned permanently in opcache.
>>
>> Case 3 should already be fast, because permanent interned strings are
>> allowed into the regex cache. We can optimize case 1 by simply allowing
>> arbitrary cache keys and discarding the cache in RSHUTDOWN -- it will not
>> be needed anymore anyway. Case 2 would remain slow, but it's slow anyway...
>>
>> Nikita
>>
>>
>>> On Thu, Mar 21, 2019 at 7:35 AM Nikita Popov <nikita....@gmail.com>
>>> wrote:
>>>
>>>> On Wed, Mar 20, 2019 at 4:35 PM C. Scott Ananian <
>>>> canan...@wikimedia.org> wrote:
>>>>
>>>>> On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov <nikita....@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> After thinking about this some more, while this may be a minor
>>>>>> performance improvement, it still does more work than necessary. In
>>>>>> particular the use of OFFSET_CAPTURE (which would be pretty much required
>>>>>> here) needs one new two-element array for each subpattern. If the 
>>>>>> captured
>>>>>> strings are short, this is where the main cost is going to be.
>>>>>>
>>>>>
>>>>> The primary use of this feature is when the captured strings are
>>>>> *long*, as that's when we most want to avoid copying a substring.
>>>>>
>>>>>
>>>>>> I'm wondering if we shouldn't consider a new object oriented API for
>>>>>> PCRE which can return a match object where subpattern positions and
>>>>>> contents can be queried via method calls, so you only pay for the parts
>>>>>> that you do access.
>>>>>>
>>>>>
>>>>> Seems like this is letting the perfect be the enemy of the good.  The
>>>>> LENGTH_CAPTURE significantly reduces allocation for long match strings, 
>>>>> and
>>>>> it allocates the same two-element arrays that OFFSET_CAPTURE would -- it
>>>>> just stores an integer where there would otherwise be an expensive
>>>>> substring.  Furthermore, since the array structure is left mostly alone, 
>>>>> it
>>>>> would be not-too-hard to support earlier-PHP versions, with something 
>>>>> like:
>>>>>
>>>>> $hasLengthCapture = defined('PREG_LENGTH_CAPTURE') ?
>>>>> PREG_LENGTH_CAPTURE : 0;
>>>>> $r = preg_match($pat, $sub, $m, PREG_OFFSET_CAPTURE |
>>>>> $hasLengthCapture);
>>>>> $matchOneLength = $hasLengthCapture ? $m[1][0] : strlen($m[1][0]);
>>>>> $matchOneOffset = $m[1][1];
>>>>>
>>>>> If you introduce a whole new OO accessor object, it starts becoming
>>>>> very hard to write backward-compatible code.
>>>>>  --scott
>>>>>
>>>>
>>>> Fair enough. I've created https://github.com/php/php-src/pull/3971 to
>>>> implement this feature. It would be good to have some confirmation that
>>>> this is really a significant performance improvement before we land it
>>>> though.
>>>>
>>>> Nikita
>>>>
>>>
>>>
>>> --
>>> (http://cscott.net)
>>>
>>
>
> --
> (http://cscott.net)
>

Re: [PHP-DEV] Offset-only results from preg_match

Reply via email to