On Sat, Mar 23, 2019 at 6:32 AM C. Scott Ananian <canan...@wikimedia.org> wrote:
> So... > > In microbenchmarks you can clearly see the improvement: > ``` > >>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m, > PREG_OFFSET_CAPTURE); > => 39 > Command took 0.001709 seconds on average (0.001654 median; 0.854503 total) > to complete. > >>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m, > PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE); > => 39 > Command took 0.000991 seconds on average (0.000965 median; 0.495287 total) > to complete. > ``` > $html100 is my usual 2,592,386 byte HTML of [[en:Barack Obama]]. > > But unless you're matching 64k strings like this, it's hard to see a > practical improvement. In my remex-html html parsing benchmark, although > LENGTH_CAPTURE doesn't make things slower, it doesn't make a significant > performance improvement. I built php statically and ran it through > cachegrind to try to figure out why, and found: > > 2,567,559,670 cycles: total spent executing the tokenizer benchmark > (including reading the file from disk) > 1,018,845,290 cycles in zif_preg_match. Optimizing regexps is important > for tokenizers! Of these, we spend > 575,478,637 doing the actual match (preg_pcre_match_impl) and > 435,162,131 getting the regexp from the cache (!) > (pcre_get_compiled_regex_cache) > > This is for 128,794 total regexp matches performed by the tokenizer on > this input. > > Of those cycles getting the regex from cache, only 24,116,319 are spent on > cache misses where we are actually compiling regexps (sum of > php_pcre2_jit_compile and php_pcre2_compile). > Instead, 406,007,383 cycles are spent in zend_hash_find(). That's 40% of > the total time spent executing preg_match. > > The LENGTH_CAPTURE optimization does reduce the total time spent in > populate_subpat_array from 160,951,690 cycles to 140,042,331 cycles in the > remex-html tokenizer on this benchmark, but that difference is overwhelmed > by (for example) the time spent in zend_hash_find(). > > The slowdown in zend_hash_find() appears to be due to > https://bugs.php.net/bug.php?id=63180 which disabled interned keys in the > pcre_cache table. Because of this, even if the regexs are interned, we > must pay for a complete zend_string_equal_content() on each match, which > takes time proportional to the length of the regex. This can be quite > large -- for example, for the HTML5 character reference regex in > remex-html, which contains every valid entity name and is 26,137 bytes > long, and we need to do a zend_string_equal_content() on the 26,137 byte > regexp for every ~5 byte entity in the parsed HTML. > --scott > Thanks for testing! That's an interesting result. We should be able to do something about this. There are basically three cases: 1. CLI (presumably what you're testing). Strings are interned per-request, but there is only one request. 2. Server w/o opcache. Strings are interned per-request and there may be multiple requests. 3. Server with opcache. Strings are interned permanently in opcache. Case 3 should already be fast, because permanent interned strings are allowed into the regex cache. We can optimize case 1 by simply allowing arbitrary cache keys and discarding the cache in RSHUTDOWN -- it will not be needed anymore anyway. Case 2 would remain slow, but it's slow anyway... Nikita > On Thu, Mar 21, 2019 at 7:35 AM Nikita Popov <nikita....@gmail.com> wrote: > >> On Wed, Mar 20, 2019 at 4:35 PM C. Scott Ananian <canan...@wikimedia.org> >> wrote: >> >>> On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov <nikita....@gmail.com> >>> wrote: >>> >>>> After thinking about this some more, while this may be a minor >>>> performance improvement, it still does more work than necessary. In >>>> particular the use of OFFSET_CAPTURE (which would be pretty much required >>>> here) needs one new two-element array for each subpattern. If the captured >>>> strings are short, this is where the main cost is going to be. >>>> >>> >>> The primary use of this feature is when the captured strings are *long*, >>> as that's when we most want to avoid copying a substring. >>> >>> >>>> I'm wondering if we shouldn't consider a new object oriented API for >>>> PCRE which can return a match object where subpattern positions and >>>> contents can be queried via method calls, so you only pay for the parts >>>> that you do access. >>>> >>> >>> Seems like this is letting the perfect be the enemy of the good. The >>> LENGTH_CAPTURE significantly reduces allocation for long match strings, and >>> it allocates the same two-element arrays that OFFSET_CAPTURE would -- it >>> just stores an integer where there would otherwise be an expensive >>> substring. Furthermore, since the array structure is left mostly alone, it >>> would be not-too-hard to support earlier-PHP versions, with something like: >>> >>> $hasLengthCapture = defined('PREG_LENGTH_CAPTURE') ? PREG_LENGTH_CAPTURE >>> : 0; >>> $r = preg_match($pat, $sub, $m, PREG_OFFSET_CAPTURE | $hasLengthCapture); >>> $matchOneLength = $hasLengthCapture ? $m[1][0] : strlen($m[1][0]); >>> $matchOneOffset = $m[1][1]; >>> >>> If you introduce a whole new OO accessor object, it starts becoming very >>> hard to write backward-compatible code. >>> --scott >>> >> >> Fair enough. I've created https://github.com/php/php-src/pull/3971 to >> implement this feature. It would be good to have some confirmation that >> this is really a significant performance improvement before we land it >> though. >> >> Nikita >> > > > -- > (http://cscott.net) >