Hi Tom, On Thu, Feb 11, 2021, at 05:39, Tom Lane wrote: >0001-invent-rainbow-arcs.patch >0002-recognize-matchall-NFAs.patch
Many thanks for working on the regex engine, this looks like an awesome optimization. To test the correctness of the patches, I thought it would be nice with some real-life regexes, and just as important, some real-life text strings, to which the real-life regexes are applied to. I therefore patched Chromium's v8 regexes engine, to log the actual regexes that get compiled when visiting websites, and also the text strings that are the regexes are applied to during run-time when the regexes are executed. I logged the regex and text strings as base64 encoded strings to STDOUT, to make it easy to grep out the data, so it could be imported into PostgreSQL for analytics. In total, I scraped the first-page of some ~50k websites, which produced 45M test rows to import, which when GROUP BY pattern and flags was reduced down to 235k different regex patterns, and 1.5M different text string subjects. Here are some statistics on the different flags used: SELECT *, SUM(COUNT) OVER () FROM (SELECT flags, COUNT(*) FROM patterns GROUP BY flags) AS x ORDER BY COUNT DESC; flags | count | sum -------+--------+-------- | 150097 | 235204 i | 43537 | 235204 g | 22029 | 235204 gi | 15416 | 235204 gm | 2411 | 235204 gim | 602 | 235204 m | 548 | 235204 im | 230 | 235204 y | 193 | 235204 gy | 60 | 235204 giy | 29 | 235204 giu | 26 | 235204 u | 11 | 235204 iy | 6 | 235204 gu | 5 | 235204 gimu | 2 | 235204 iu | 1 | 235204 my | 1 | 235204 (18 rows) As we can see, no flag at all is the most common, followed by the "i" flag. Most of the Javascript-regexes (97%) could be understood by PostgreSQL, only 3% produced some kind of error, which is not unexpected, since some Javascript-regex features like \w and \W have different syntax in PostgreSQL: SELECT *, SUM(COUNT) OVER () FROM (SELECT is_match,error,COUNT(*) FROM subjects GROUP BY is_match,error) AS x ORDER BY count DESC; is_match | error | count | sum ----------+---------------------------------------------------------------+--------+--------- f | | 973987 | 1489489 t | | 474225 | 1489489 | invalid regular expression: invalid escape \ sequence | 39141 | 1489489 | invalid regular expression: invalid character range | 898 | 1489489 | invalid regular expression: invalid backreference number | 816 | 1489489 | invalid regular expression: brackets [] not balanced | 327 | 1489489 | invalid regular expression: invalid repetition count(s) | 76 | 1489489 | invalid regular expression: quantifier operand invalid | 17 | 1489489 | invalid regular expression: parentheses () not balanced | 1 | 1489489 | invalid regular expression: regular expression is too complex | 1 | 1489489 (10 rows) Having had some fun looking at statistics, let's move on to look at if there are any observable differences between HEAD (8063d0f6f56e53edd991f53aadc8cb7f8d3fdd8f) and when these two patches have been applied. To detect any differences, for each (regex pattern, text string subject) pair, the columns, is_match boolean captured text[] error text were set by a PL/pgSQL function running HEAD: BEGIN _is_match := _subject ~ _pattern; _captured := regexp_match(_subject, _pattern); EXCEPTION WHEN OTHERS THEN UPDATE subjects SET error = SQLERRM WHERE subject_id = _subject_id; CONTINUE; END; UPDATE subjects SET is_match = _is_match, captured = _captured WHERE subject_id = _subject_id; The patches 0001-invent-rainbow-arcs.patch 0002-recognize-matchall-NFAs.patch were then applied and this query was executed to spot any differences: SELECT is_match <> (subject ~ pattern) AS is_match_diff, captured IS DISTINCT FROM regexp_match(subject, pattern) AS captured_diff, COUNT(*) FROM subjects WHERE error IS NULL AND (is_match <> (subject ~ pattern) OR captured IS DISTINCT FROM regexp_match(subject, pattern)) GROUP BY 1,2 ORDER BY 3 DESC ; The query was first run on the unpatched HEAD to verify it detects no results. 0 rows indeed, and it took this long to finish the query: Time: 186077.866 ms (03:06.078) Running the same query with the two patches, was significantly faster: Time: 111785.735 ms (01:51.786) No is_match differences were detected, good! However, there were 23 cases where what got captured differed: -[ RECORD 1 ]--+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- pattern | (?:^v-([a-z0-9-]+))?(?:(?::|^@|^#)(\[[^\]]+\]|[^\.]+))?(.+)?$ subject | v-cloak is_match_head | t captured_head | {cloak,NULL,NULL} is_match_patch | t captured_patch | {NULL,NULL,v-cloak} -[ RECORD 2 ]--+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- pattern | (?:^v-([a-z0-9-]+))?(?:(?::|^@|^#)(\[[^\]]+\]|[^\.]+))?(.+)?$ subject | v-if is_match_head | t captured_head | {if,NULL,NULL} is_match_patch | t captured_patch | {NULL,NULL,v-if} -[ RECORD 3 ]--+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- pattern | (^.*://)?((www.)?a5oc.com).* subject | https://a5oc.com/attachments/6b184e79-6a7f-43e0-ac59-7ed9d0a8eb7e-jpeg.179582/ is_match_head | t captured_head | {https://,a5oc.com,NULL <https://%2Ca5oc.com%2Cnull/>} is_match_patch | t captured_patch | -[ RECORD 4 ]--+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- pattern | (^.*://)?((www.)?allfordmustangs.com).* subject | https://allfordmustangs.com/attachments/e463e329-0397-4e13-ad41-f30c6bc0659e-jpeg.779299/ is_match_head | t captured_head | {https://,allfordmustangs.com,NULL <https://%2Callfordmustangs.com%2Cnull/>} is_match_patch | t captured_patch | -[ RECORD 5 ]--+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- pattern | (^.*://)?((www.)?audi-forums.com).* subject | https://audi-forums.com/attachments/screenshot_20210207-151100_ebay-jpg.11506/ is_match_head | t captured_head | {https://,audi-forums.com,NULL <https://%2Caudi-forums.com%2Cnull/>} is_match_patch | t captured_patch | -[ RECORD 6 ]--+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- pattern | (^.*://)?((www.)?can-amforum.com).* subject | https://can-amforum.com/attachments/resized_20201214_163325-jpeg.101395/ is_match_head | t captured_head | {https://,can-amforum.com,NULL <https://%2Ccan-amforum.com%2Cnull/>} is_match_patch | t captured_patch | -[ RECORD 7 ]--+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- pattern | (^.*://)?((www.)?contractortalk.com).* subject | https://contractortalk.com/attachments/maryann-porch-roof-quote-12feb2021-jpg.508976/ is_match_head | t captured_head | {https://,contractortalk.com,NULL <https://%2Ccontractortalk.com%2Cnull/>} is_match_patch | t captured_patch | -[ RECORD 8 ]--+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- pattern | (^.*://)?((www.)?halloweenforum.com).* subject | https://halloweenforum.com/attachments/dead-fred-head-before-and-after-jpg.744080/ is_match_head | t captured_head | {https://,halloweenforum.com,NULL <https://%2Challoweenforum.com%2Cnull/>} is_match_patch | t captured_patch | -[ RECORD 9 ]--+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- pattern | (^.*://)?((www.)?horseforum.com).* subject | https://horseforum.com/attachments/dd90f089-9ae9-4521-98cd-27bda9ad38e9-jpeg.1109329/ is_match_head | t captured_head | {https://,horseforum.com,NULL <https://%2Chorseforum.com%2Cnull/>} is_match_patch | t captured_patch | -[ RECORD 10 ]-+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- pattern | (^.*://)?((www.)?passatworld.com).* subject | https://passatworld.com/attachments/clean-passat-jpg.102337/ is_match_head | t captured_head | {https://,passatworld.com,NULL <https://%2Cpassatworld.com%2Cnull/>} is_match_patch | t captured_patch | -[ RECORD 11 ]-+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- pattern | (^.*://)?((www.)?plantedtank.net).* subject | https://plantedtank.net/attachments/brendon-60p-jpg.1026075/ is_match_head | t captured_head | {https://,plantedtank.net,NULL <https://%2Cplantedtank.net%2Cnull/>} is_match_patch | t captured_patch | -[ RECORD 12 ]-+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- pattern | (^.*://)?((www.)?vauxhallownersnetwork.co.uk).* subject | https://vauxhallownersnetwork.co.uk/attachments/opelnavi-jpg.96639/ is_match_head | t captured_head | {https://,vauxhallownersnetwork.co.uk,NULL <https://%2Cvauxhallownersnetwork.co.uk%2Cnull/>} is_match_patch | t captured_patch | -[ RECORD 13 ]-+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- pattern | (^.*://)?((www.)?volvov40club.com).* subject | https://volvov40club.com/attachments/img_20210204_164157-jpg.17356/ is_match_head | t captured_head | {https://,volvov40club.com,NULL <https://%2Cvolvov40club.com%2Cnull/>} is_match_patch | t captured_patch | -[ RECORD 14 ]-+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- pattern | (^.*://)?((www.)?vwidtalk.com).* subject | https://vwidtalk.com/attachments/1613139846689-png.1469/ is_match_head | t captured_head | {https://,vwidtalk.com,NULL <https://%2Cvwidtalk.com%2Cnull/>} is_match_patch | t captured_patch | -[ RECORD 15 ]-+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- pattern | (^.*://)?((www.)?yellowbullet.com).* subject | https://yellowbullet.com/attachments/20210211_133934-jpg.204604/ is_match_head | t captured_head | {https://,yellowbullet.com,NULL <https://%2Cyellowbullet.com%2Cnull/>} is_match_patch | t captured_patch | -[ RECORD 16 ]-+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- pattern | (^[^\?]*)?(\?[^#]*)?(#.*$)? subject | https://www.disneyonice.com/oneIdResponder.html is_match_head | t captured_head | {https://www.disneyonice.com/oneIdResponder.html,NULL,NULL} is_match_patch | t captured_patch | -[ RECORD 17 ]-+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- pattern | (^[a-zA-Z0-9\/_-]+)*(\.[a-zA-Z]+)? subject | / is_match_head | t captured_head | {/,NULL} is_match_patch | t captured_patch | -[ RECORD 18 ]-+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- pattern | (^[a-zA-Z0-9\/_-]+)*(\.[a-zA-Z]+)? subject | /en.html is_match_head | t captured_head | {/en,.html} is_match_patch | t captured_patch | -[ RECORD 19 ]-+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- pattern | (^https?:\/\/)?(((\[[^\]]+\])|([^:\/\?#]+))(:(\d+))?)?([^\?#]*)(.*)? subject | https://e.echatsoft.com/mychat/visitor is_match_head | t captured_head | {https://,e.echatsoft.com,e.echatsoft.com,NULL,e.echatsoft.com,NULL,NULL,/mychat/visitor <https://%2Ce.echatsoft.com%2Ce.echatsoft.com%2Cnull%2Ce.echatsoft.com%2Cnull%2Cnull%2C/mychat/visitor>,""} is_match_patch | t captured_patch | {NULL,https,https,NULL,https,NULL,NULL,://e.echatsoft.com/mychat/visitor,""} -[ RECORD 20 ]-+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ---------------------------------------- pattern | (^|.)41nbc.com$|(^|.)41nbc.dev$|(^|.)52.23.179.12$|(^|.)52.3.245.221$|(^|.)clipsyndicate.com$|(^|.)michaelbgiordano.com$|(^|.)syndicaster.tv$|(^|.)wdef.com$|(^|.)wdef.dev$|(^|.)wxxv.mysiteserver.net$|(^|.)wxxv25.dev$|(^|.)clipsyndicate.com$|(^|.)syndicaster.tv$ subject | wdef.com is_match_head | t captured_head | {NULL,NULL,NULL,NULL,NULL,NULL,NULL,"",NULL,NULL,NULL,NULL,NULL} is_match_patch | t captured_patch | -[ RECORD 21 ]-+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- pattern | ^((^\w+:|^)\/\/)?(?:www\.)? subject | https://www.deputy.com/ is_match_head | t captured_head | {https://,https <https://%2Chttps/>:} is_match_patch | t captured_patch | -[ RECORD 22 ]-+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- pattern | ^((^\w+:|^)\/\/)?(?:www\.)? subject | https://www.westernsydney.edu.au/ is_match_head | t captured_head | {https://,https <https://%2Chttps/>:} is_match_patch | t captured_patch | -[ RECORD 23 ]-+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- pattern | ^(https?:){0,1}\/\/| subject | https://ui.powerreviews.com/api/ is_match_head | t captured_head | {https:} is_match_patch | t captured_patch | {NULL} The code to reproduce the results have been pushed here: https://github.com/truthly/regexes-in-the-wild Let me know if you want access to the dataset, I could open up a port to my PostgreSQL so you could take a dump. SELECT pg_size_pretty(pg_relation_size('patterns')) AS patterns, pg_size_pretty(pg_relation_size('subjects')) AS subjects; patterns | subjects ----------+---------- 20 MB | 568 MB (1 row) /Joel