Re: Some regular-expression performance hacking

Tom Lane Thu, 18 Feb 2021 10:11:08 -0800

"Joel Jacobson" <j...@compiler.org> writes:
>> I've produced a new dataset which now also includes the regex flags (if
>> any) used for each subject applied to a pattern.


Again, thanks for collecting this data!  I'm a little confused about
how you produced the results in the "tests" table, though.  It sort
of looks like you tried to feed the Javascript flags to regexp_match(),
which unsurprisingly doesn't work all that well.  Even discounting
that, I'm not getting quite the same results, and I don't understand
why not.  So how was that made from the raw "patterns" and "subjects"
tables?

> PostgreSQL 13.2 on x86_64-apple-darwin19.6.0, compiled by Apple clang version 
> 11.0.3 (clang-1103.0.32.62), 64-bit
> Time: 758514.703 ms (12:38.515)
> Time: 755883.600 ms (12:35.884)
> Time: 746522.107 ms (12:26.522)
> 
> PostgreSQL 14devel on x86_64-apple-darwin20.3.0, compiled by Apple clang 
> version 12.0.0 (clang-1200.0.32.29), 64-bit
> HEAD (4e703d671)
> Time: 519620.646 ms (08:39.621)
> Time: 518998.366 ms (08:38.998)
> Time: 519696.129 ms (08:39.696)

Hmmm ... we haven't yet committed any performance-relevant changes to the
regex code, so it can't take any credit for this improvement from 13.2 to
HEAD.  I speculate that this is due to some change in our parallelism
stuff (since I observe that this query is producing a parallelized hash
plan).  Still, the next drop to circa 2:21 runtime is impressive enough
by itself.

> Heh, what a funny coincidence:
> The regex I used to shrink the very-long-pattern,
> actually happens to run a lot faster with the patches.

Yeah, that just happens to be a poster child for the MATCHALL idea:

> EXPLAIN ANALYZE SELECT 
> regexp_matches(repeat('a',100000),'^(.{1,80})(.*?)(.{1,80})$');

Each of the parenthesized subexpressions of the RE is successfully
recognized as being MATCHALL, with length range 1..80 for two of them and
0..infinity for the middle one.  That means the engine doesn't have to
physically scan the text to determine whether a possible division point
satisfies the sub-regexp; and that means we can find the correct division
points in O(N) not O(N^2) time.

                        regards, tom lane

Re: Some regular-expression performance hacking

Reply via email to