On Sat, Feb 13, 2021, at 22:11, Tom Lane wrote:
>0001-invent-rainbow-arcs-2.patch
>0002-recognize-matchall-NFAs-2.patch

I've successfully tested both patches against the 1.5M regexes-in-the-wild 
dataset.

Out of the 1489489 (pattern, text string) pairs tested,
there was only one single deviation:

This 100577 bytes big regex (pattern_id = 207811)...

\.(ac|com\.ac|edu\.ac|gov\.ac|net\.ac|mil\.ac| ... 
|wmflabs\.org|yolasite\.com|za\.net|za\.org)$

...previously raised...

    error invalid regular expression: regular expression is too complex

...but now goes through:

is_match <NULL> => t
captured <NULL> => {de}
error invalid regular expression: regular expression is too complex => <NULL>

Nice. The patched regex engine is apparently capable of handling even more 
complex regexes than before.

The test that found the deviation tests each (pattern, text string) 
individually,
to catch errors. But I've also made a separate query to just test regexes
known to not cause errors, to allow testing all regexes in one big query,
which fully utilizes the CPU cores and also runs quicker.

Below is the result of the performance test query:

\timing

SELECT
  tests.is_match IS NOT DISTINCT FROM (subjects.subject ~ patterns.pattern),
  tests.captured IS NOT DISTINCT FROM regexp_match(subjects.subject, 
patterns.pattern),
  COUNT(*)
FROM tests
JOIN subjects ON subjects.subject_id = tests.subject_id
JOIN patterns ON patterns.pattern_id = subjects.pattern_id
JOIN server_versions ON server_versions.server_version_num = 
tests.server_version_num
WHERE server_versions.server_version = current_setting('server_version')
AND tests.error IS NULL
GROUP BY 1,2
ORDER BY 1,2;

-- 8facf1ea00b7a0c08c755a0392212b83e04ae28a:

?column? | ?column? |  count
----------+----------+---------
t        | t        | 1448212
(1 row)

Time: 592196.145 ms (09:52.196)

-- 8facf1ea00b7a0c08c755a0392212b83e04ae28a+patches:

?column? | ?column? |  count
----------+----------+---------
t        | t        | 1448212
(1 row)

Time: 461739.364 ms (07:41.739)

That's an impressive 22% speed-up!

/Joel

Reply via email to