Continuing this saga: I'm still having performance problems on character entity expansion. Here's the baseline code: https://github.com/wikimedia/remex-html/blob/master/RemexHtml/Tokenizer/Tokenizer.php#L881 Of note: the regular expression is quite large -- around 26kB -- because it needs to include the complete table of all HTML5 entities, which it gets from a separate file of tables, HTMLData.php.
Recapping briefly: we established before that it is very important that large regex strings been interned, otherwise pcre_get_compiled_regex_cache needs to do a full zend_string_equal_content() on every call to preg_match*, and since the strings will match, that costs a complete traversal of the 26kB regexp string. If I inline the char ref table directly into the regexp as a single huge literal string, that string is interned and (with Nikita's recent fixes for the CLI) things are ok. But that's bad for code maintainability; it violates Do Not Repeat Yourself and now it's much harder to see what the character reference regexp is doing because it's got this huge 26k table embedded in the middle of it. PHP will let me initialize the string as: const CHAR_REF_REGEXP = ' ... ' . HTMLData::NAMED_ENTITY_REGEX . "..."; that is, it recognizes this as a compile-time constant -- but it doesn't actually intern the resulting string. The code in zend_declare_class_constant_ex interns *most* constant strings, but in this case because there is a reference to another constant, the Z_TYPE_P(value) == IS_STRING check in zend_declare_class_constant_ex fails (the actual type is IS_CONSTANT_AST) presumably because we don't want to autoload HTMLData too soon. (But this also seems to happen even if I use self::NAMED_ENTITY_REGEX here, which wouldn't trigger the autoloader.) I *think* the proper fix is to intern the string lazily when it is finally evaluated, in ZEND_FETCH_CLASS_CONSTANT_SPEC_CONST_CONST_HANDLER around the point where we check Z_TYPE_P(value) == IS_CONSTANT_AST -- probably by tweaking zval_update_constant_ex to intern any string result? Thought I'd run that thought process by folks before I go further. --scott On Tue, Mar 26, 2019 at 5:31 AM Nikita Popov <nikita....@gmail.com> wrote: > On Sat, Mar 23, 2019 at 4:07 PM C. Scott Ananian <canan...@wikimedia.org> > wrote: > >> Yup, testing via CLI but Wikimedia will (eventually) be running PHP 7.x >> with opcache ( https://phabricator.wikimedia.org/T176370 / >> https://phabricator.wikimedia.org/T211964 ). It would be nice to fix >> the CLI to behave more like the server wrt interned strings. It certainly >> would make benchmarking easier/more representative, but we also have a >> bunch of testing/CI stuff running in CLI mode so speedups there are >> definitely useful, even if they don't "speed up Wikipedia" per se. >> >> Ignoring the zend_hash_find costs for the moment, php_pcre_match_impl >> costs 492,077,935 cycles in this benchmark, of which only 211,626,095 are >> spent doing the actual php_pcre2_jit_match. 140,000,069 cycles are spent >> in populate_subpat_array and 108,742,309 are spent in the zval_ptr_dtor >> call on the penultimate line -- basically freeing the $matches array from >> the previous call to pcre_match (and this is with PREG_LENGTH_CAPTURE). So >> there's still (in theory) a >2x speedup in preg matching available by >> tuning how the subpat_array works and making it less costly to >> allocate/free $matches. >> --scott >> > > The aforementioned optimization to the cache lookup on CLI is now > implemented via https://github.com/php/php-src/pull/3985. > > There is more that can be improved here, but it falls more into the realm > of micro-optimization... For example, we can specialize the creation of the > offset pairs to reduce the overhead ( > https://github.com/php/php-src/pull/3990), but this still remains the > most expensive part of the subpat population... > > Nikita > > >> On Sat, Mar 23, 2019 at 6:13 AM Nikita Popov <nikita....@gmail.com> >> wrote: >> >>> On Sat, Mar 23, 2019 at 6:32 AM C. Scott Ananian <canan...@wikimedia.org> >>> wrote: >>> >>>> So... >>>> >>>> In microbenchmarks you can clearly see the improvement: >>>> ``` >>>> >>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m, >>>> PREG_OFFSET_CAPTURE); >>>> => 39 >>>> Command took 0.001709 seconds on average (0.001654 median; 0.854503 >>>> total) to complete. >>>> >>> timeit -n500 preg_match_all('/(.{65535})/s', $html100, $m, >>>> PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE); >>>> => 39 >>>> Command took 0.000991 seconds on average (0.000965 median; 0.495287 >>>> total) to complete. >>>> ``` >>>> $html100 is my usual 2,592,386 byte HTML of [[en:Barack Obama]]. >>>> >>>> But unless you're matching 64k strings like this, it's hard to see a >>>> practical improvement. In my remex-html html parsing benchmark, although >>>> LENGTH_CAPTURE doesn't make things slower, it doesn't make a significant >>>> performance improvement. I built php statically and ran it through >>>> cachegrind to try to figure out why, and found: >>>> >>>> 2,567,559,670 cycles: total spent executing the tokenizer benchmark >>>> (including reading the file from disk) >>>> 1,018,845,290 cycles in zif_preg_match. Optimizing regexps is >>>> important for tokenizers! Of these, we spend >>>> 575,478,637 doing the actual match (preg_pcre_match_impl) and >>>> 435,162,131 getting the regexp from the cache (!) >>>> (pcre_get_compiled_regex_cache) >>>> >>>> This is for 128,794 total regexp matches performed by the tokenizer on >>>> this input. >>>> >>>> Of those cycles getting the regex from cache, only 24,116,319 are spent >>>> on cache misses where we are actually compiling regexps (sum of >>>> php_pcre2_jit_compile and php_pcre2_compile). >>>> Instead, 406,007,383 cycles are spent in zend_hash_find(). That's 40% >>>> of the total time spent executing preg_match. >>>> >>>> The LENGTH_CAPTURE optimization does reduce the total time spent in >>>> populate_subpat_array from 160,951,690 cycles to 140,042,331 cycles in the >>>> remex-html tokenizer on this benchmark, but that difference is overwhelmed >>>> by (for example) the time spent in zend_hash_find(). >>>> >>>> The slowdown in zend_hash_find() appears to be due to >>>> https://bugs.php.net/bug.php?id=63180 which disabled interned keys in >>>> the pcre_cache table. Because of this, even if the regexs are interned, we >>>> must pay for a complete zend_string_equal_content() on each match, which >>>> takes time proportional to the length of the regex. This can be quite >>>> large -- for example, for the HTML5 character reference regex in >>>> remex-html, which contains every valid entity name and is 26,137 bytes >>>> long, and we need to do a zend_string_equal_content() on the 26,137 byte >>>> regexp for every ~5 byte entity in the parsed HTML. >>>> --scott >>>> >>> >>> Thanks for testing! That's an interesting result. We should be able to >>> do something about this. There are basically three cases: >>> >>> 1. CLI (presumably what you're testing). Strings are interned >>> per-request, but there is only one request. >>> 2. Server w/o opcache. Strings are interned per-request and there may be >>> multiple requests. >>> 3. Server with opcache. Strings are interned permanently in opcache. >>> >>> Case 3 should already be fast, because permanent interned strings are >>> allowed into the regex cache. We can optimize case 1 by simply allowing >>> arbitrary cache keys and discarding the cache in RSHUTDOWN -- it will not >>> be needed anymore anyway. Case 2 would remain slow, but it's slow anyway... >>> >>> Nikita >>> >>> >>>> On Thu, Mar 21, 2019 at 7:35 AM Nikita Popov <nikita....@gmail.com> >>>> wrote: >>>> >>>>> On Wed, Mar 20, 2019 at 4:35 PM C. Scott Ananian < >>>>> canan...@wikimedia.org> wrote: >>>>> >>>>>> On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov <nikita....@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> After thinking about this some more, while this may be a minor >>>>>>> performance improvement, it still does more work than necessary. In >>>>>>> particular the use of OFFSET_CAPTURE (which would be pretty much >>>>>>> required >>>>>>> here) needs one new two-element array for each subpattern. If the >>>>>>> captured >>>>>>> strings are short, this is where the main cost is going to be. >>>>>>> >>>>>> >>>>>> The primary use of this feature is when the captured strings are >>>>>> *long*, as that's when we most want to avoid copying a substring. >>>>>> >>>>>> >>>>>>> I'm wondering if we shouldn't consider a new object oriented API for >>>>>>> PCRE which can return a match object where subpattern positions and >>>>>>> contents can be queried via method calls, so you only pay for the parts >>>>>>> that you do access. >>>>>>> >>>>>> >>>>>> Seems like this is letting the perfect be the enemy of the good. The >>>>>> LENGTH_CAPTURE significantly reduces allocation for long match strings, >>>>>> and >>>>>> it allocates the same two-element arrays that OFFSET_CAPTURE would -- it >>>>>> just stores an integer where there would otherwise be an expensive >>>>>> substring. Furthermore, since the array structure is left mostly alone, >>>>>> it >>>>>> would be not-too-hard to support earlier-PHP versions, with something >>>>>> like: >>>>>> >>>>>> $hasLengthCapture = defined('PREG_LENGTH_CAPTURE') ? >>>>>> PREG_LENGTH_CAPTURE : 0; >>>>>> $r = preg_match($pat, $sub, $m, PREG_OFFSET_CAPTURE | >>>>>> $hasLengthCapture); >>>>>> $matchOneLength = $hasLengthCapture ? $m[1][0] : strlen($m[1][0]); >>>>>> $matchOneOffset = $m[1][1]; >>>>>> >>>>>> If you introduce a whole new OO accessor object, it starts becoming >>>>>> very hard to write backward-compatible code. >>>>>> --scott >>>>>> >>>>> >>>>> Fair enough. I've created https://github.com/php/php-src/pull/3971 to >>>>> implement this feature. It would be good to have some confirmation that >>>>> this is really a significant performance improvement before we land it >>>>> though. >>>>> >>>>> Nikita >>>>> >>>> >>>> >>>> -- >>>> (http://cscott.net) >>>> >>> >> >> -- >> (http://cscott.net) >> > -- (http://cscott.net)