Re: Some regular-expression performance hacking

2021-03-06 Thread Noah Misch
On Sat, Feb 13, 2021 at 06:19:34PM +0100, Joel Jacobson wrote: > To test the correctness of the patches, > I thought it would be nice with some real-life regexes, > and just as important, some real-life text strings, > to which the real-life regexes are applied to. > > I therefore patched Chromium

Re: Some regular-expression performance hacking

2021-03-05 Thread Joel Jacobson
On Fri, Feb 26, 2021, at 19:55, Tom Lane wrote: > "Joel Jacobson" writes: > > On Fri, Feb 26, 2021, at 01:16, Tom Lane wrote: > >> 0007-smarter-regex-allocation-2.patch > > > I've successfully tested this patch. > > Cool, thanks for testing! I thought it would be interesting to see if any diffe

Re: Some regular-expression performance hacking

2021-02-26 Thread Tom Lane
"Joel Jacobson" writes: > On Fri, Feb 26, 2021, at 01:16, Tom Lane wrote: >> 0007-smarter-regex-allocation-2.patch > I've successfully tested this patch. Cool, thanks for testing! > That's a 29% speed-up compared to HEAD! Truly amazing. Hmm, I'm still only seeing about 10% or a little better.

Re: Some regular-expression performance hacking

2021-02-26 Thread Joel Jacobson
Hi, On Fri, Feb 26, 2021, at 01:16, Tom Lane wrote: > 0007-smarter-regex-allocation-2.patch I've successfully tested this patch. I had to re-create the performance_test table since some cases the previously didn't give an error, now gives error "invalid regular expression: invalid character rang

Re: Some regular-expression performance hacking

2021-02-25 Thread Tom Lane
I wrote: > So I rearranged things to allocate arcs out of a common pool, and for > good measure made the state allocation code do the same thing. I was > pretty much blown away by the results: not only is the average-case > space usage about half what it is on HEAD, but the worst-case drops > by w

Re: Some regular-expression performance hacking

2021-02-25 Thread Tom Lane
I wrote: > However, in a different line of thought, I realized that the > memory allocation logic could use some polishing. It gives out > ten arcs per NFA state initially, and then adds ten more at a time. > However, that's not very bright when you look at the actual usage > patterns, because mos

Re: Some regular-expression performance hacking

2021-02-24 Thread Tom Lane
I wrote: > On my machine, the combination of these two ideas reduces the > runtime of the example above from ~150 seconds to ~53 seconds, > or nearly 3x better. I see something like a 2% improvement on > Joel's test corpus, which might just be noise. So this isn't > any sort of universal panacea,

Re: Some regular-expression performance hacking

2021-02-23 Thread Tom Lane
Here's another little piece of regex performance hacking. This is based on looking at a slow regexp I found in Tcl's bug tracker: -- Adapted from http://core.tcl.tk/tcl/tktview?name=446565 select regexp_matches( repeat(' 123 345 123 ', 10), '))*?(doubleclick|flycast|burstnet|spylog)+?.*?'

Re: Some regular-expression performance hacking

2021-02-23 Thread Tom Lane
Andres Freund writes: > On 2021-02-23 13:09:18 -0500, Tom Lane wrote: >> Oddly, I see no such warning with Fedora's current compiler, >> gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC) >> Are you using any special compiler switches? > A few. At first I didn't see any relevant ones - but I th

Re: Some regular-expression performance hacking

2021-02-23 Thread Tom Lane
Andres Freund writes: > On 2021-02-23 12:52:28 -0500, Tom Lane wrote: >> ... It is annoying to have to expend >> an always-on check for a can't-happen case, though. > Wouldn't quite work like that because of the restrictions of what pg > infrastructure we want to expose the regex engine to, but a

Re: Some regular-expression performance hacking

2021-02-23 Thread Andres Freund
On 2021-02-23 13:09:18 -0500, Tom Lane wrote: > Andres Freund writes: > > One of the recent commits have introduce a new warning with gcc 10, when > > building with optimizations: > > Oddly, I see no such warning with Fedora's current compiler, > gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC

Re: Some regular-expression performance hacking

2021-02-23 Thread Tom Lane
Andres Freund writes: > One of the recent commits have introduce a new warning with gcc 10, when > building with optimizations: Oddly, I see no such warning with Fedora's current compiler, gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC) Are you using any special compiler switches?

Re: Some regular-expression performance hacking

2021-02-23 Thread Andres Freund
Hi, On 2021-02-23 12:52:28 -0500, Tom Lane wrote: > I wrote: > > Hmph. There's an "assert(depth >= 0)" immediately in front of that, > > so I'm not looking too kindly on the compiler thinking it's smarter > > than I am. Do you have a suggestion on how to shut it up? gcc can't see the assert tho

Re: Some regular-expression performance hacking

2021-02-23 Thread Tom Lane
I wrote: > Hmph. There's an "assert(depth >= 0)" immediately in front of that, > so I'm not looking too kindly on the compiler thinking it's smarter > than I am. Do you have a suggestion on how to shut it up? On reflection, maybe the thing to do is convert the assert into an always-on check, "if

Re: Some regular-expression performance hacking

2021-02-23 Thread Tom Lane
Andres Freund writes: > One of the recent commits have introduce a new warning with gcc 10, when > building with optimizations: > In file included from > /home/andres/src/postgresql/src/backend/regex/regcomp.c:2304: > /home/andres/src/postgresql/src/backend/regex/regc_nfa.c: In function > ‘chec

Re: Some regular-expression performance hacking

2021-02-23 Thread Andres Freund
Hi, One of the recent commits have introduce a new warning with gcc 10, when building with optimizations: In file included from /home/andres/src/postgresql/src/backend/regex/regcomp.c:2304: /home/andres/src/postgresql/src/backend/regex/regc_nfa.c: In function ‘checkmatchall’: /home/andres/src/p

Re: Some regular-expression performance hacking

2021-02-20 Thread Tom Lane
Chapman Flack writes: > On 02/19/21 10:26, Tom Lane wrote: >> Oooh, that's very interesting. I guess the advantage of that over using >> the 's' flag is that you can have different behaviors at different places >> in the same regex. > Perl, Python, and Java (at least) all have a common syntax f

Re: Some regular-expression performance hacking

2021-02-20 Thread Chapman Flack
On 02/19/21 10:26, Tom Lane wrote: >> "foo\nbar".match(/([\w\W]+)/)[1]; >> "foo >> bar" > > Oooh, that's very interesting. I guess the advantage of that over using > the 's' flag is that you can have different behaviors at different places > in the same regex. Perl, Python, and Java (at least)

Re: Some regular-expression performance hacking

2021-02-20 Thread Joel Jacobson
On Fri, Feb 19, 2021, at 16:26, Tom Lane wrote: >"Joel Jacobson" writes: >> On Thu, Feb 18, 2021, at 19:53, Tom Lane wrote: >>> (Having said that, I can't help noticing that a very large fraction >>> of those usages look like, eg, "[\w\W]". It seems to me that that's >>> a very expensive and unwi

Re: Some regular-expression performance hacking

2021-02-19 Thread Tom Lane
"Joel Jacobson" writes: > On Thu, Feb 18, 2021, at 19:53, Tom Lane wrote: >> (Having said that, I can't help noticing that a very large fraction >> of those usages look like, eg, "[\w\W]". It seems to me that that's >> a very expensive and unwieldy way to spell ".". Am I missing >> something abo

Re: Some regular-expression performance hacking

2021-02-19 Thread Joel Jacobson
On Thu, Feb 18, 2021, at 19:53, Tom Lane wrote: >(Having said that, I can't help noticing that a very large fraction >of those usages look like, eg, "[\w\W]". It seems to me that that's >a very expensive and unwieldy way to spell ".". Am I missing >something about what that does in Javascript?)

Re: Some regular-expression performance hacking

2021-02-18 Thread Joel Jacobson
On Thu, Feb 18, 2021, at 21:44, Tom Lane wrote: >Hmm ... I might be misunderstanding, but I think our engine already >does a version of this. See the discussion of "colors" in >src/backend/regex/README. Thanks, I will read it with great interest. >Maybe. In practice the actual scanning tends to

Re: Some regular-expression performance hacking

2021-02-18 Thread Tom Lane
"Joel Jacobson" writes: > Let's see if I can explain the idea: > One of the problems with representing regexes with large bracket range > expressions, like [a-z], > is you get an explosion of edges, if the model can only represent state > transitions for single characters. > If we could instead

Re: Some regular-expression performance hacking

2021-02-18 Thread Joel Jacobson
On Thu, Feb 18, 2021, at 20:58, Joel Jacobson wrote: >Like you said earlier, perhaps the regex engine has been optimized enough for >this time. >If not, you want to investigate an additional idea, In the above sentence, I meant "you _may_ want to". I'm not at all sure these idea are applicable in

Re: Some regular-expression performance hacking

2021-02-18 Thread Joel Jacobson
On Thu, Feb 18, 2021, at 19:10, Tom Lane wrote: > "Joel Jacobson" writes: > >> I've produced a new dataset which now also includes the regex flags (if > >> any) used for each subject applied to a pattern. > > Again, thanks for collecting this data! I'm a little confused about > how you produced

Re: Some regular-expression performance hacking

2021-02-18 Thread Tom Lane
I thought it was worth looking a little more closely at the error cases in this set of tests, as a form of competition analysis versus Javascript's regex engine. I ran through the cases that gave errors, and pinned down exactly what was causing the error for as many cases as I could. (These resul

Re: Some regular-expression performance hacking

2021-02-18 Thread Tom Lane
"Joel Jacobson" writes: >> I've produced a new dataset which now also includes the regex flags (if >> any) used for each subject applied to a pattern. Again, thanks for collecting this data! I'm a little confused about how you produced the results in the "tests" table, though. It sort of looks

Re: Some regular-expression performance hacking

2021-02-18 Thread Joel Jacobson
On Thu, Feb 18, 2021, at 11:30, Joel Jacobson wrote: >SELECT * FROM vdeviations; >-[ RECORD 1 >]+--- >pattern | \.(ac|com\.ac|edu\.ac|gov\.ac|net\.ac|mi ... 100497 chars >... abs\.org|

Re: Some regular-expression performance hacking

2021-02-18 Thread Joel Jacobson
On Wed, Feb 17, 2021, at 22:00, Tom Lane wrote: > Attached is an updated patch series; it's rebased over 4e703d671 > which took care of some not-really-related fixes, and I made a > pass of cleanup and comment improvements. I think this is pretty > much ready to commit, unless you want to do more

Re: Some regular-expression performance hacking

2021-02-17 Thread Tom Lane
"Joel Jacobson" writes: > I've tested all 4 patches successfully. Thanks! I found one other area that could be improved with the same idea of getting rid of unnecessary subre's: right now, every pair of capturing parentheses gives rise to a "capture" subre with an NFA identical to its single chi

Re: Some regular-expression performance hacking

2021-02-15 Thread Joel Jacobson
On Mon, Feb 15, 2021, at 04:11, Tom Lane wrote: >I got these runtimes (non-cassert builds): > >HEAD 313661.149 ms (05:13.661) >+0001 297397.293 ms (04:57.397) 5% better than HEAD >+0002 151995.803 ms (02:31.996) 51% better than HEAD >+0003 139843.934 ms (02:19.844) 55% better than HEAD >+0004 95034

Re: Some regular-expression performance hacking

2021-02-14 Thread Tom Lane
"Joel Jacobson" writes: > Below is the result of the performance test query: > -- 8facf1ea00b7a0c08c755a0392212b83e04ae28a: > Time: 592196.145 ms (09:52.196) > -- 8facf1ea00b7a0c08c755a0392212b83e04ae28a+patches: > Time: 461739.364 ms (07:41.739) > That's an impressive 22% speed-up! I've been doi

Re: Some regular-expression performance hacking

2021-02-14 Thread Tom Lane
"Joel Jacobson" writes: > I've successfully tested both patches against the 1.5M regexes-in-the-wild > dataset. > Out of the 1489489 (pattern, text string) pairs tested, > there was only one single deviation: > This 100577 bytes big regex (pattern_id = 207811)... > ... > ...previously raised... >

Re: Some regular-expression performance hacking

2021-02-14 Thread Joel Jacobson
On Sat, Feb 13, 2021, at 22:11, Tom Lane wrote: >0001-invent-rainbow-arcs-2.patch >0002-recognize-matchall-NFAs-2.patch I've successfully tested both patches against the 1.5M regexes-in-the-wild dataset. Out of the 1489489 (pattern, text string) pairs tested, there was only one single deviation:

Re: Some regular-expression performance hacking

2021-02-13 Thread Tom Lane
"Joel Jacobson" writes: > No is_match differences were detected, good! > However, there were 23 cases where what got captured differed: These all stem from the same oversight: checkmatchall() was being too cavalier by ignoring "pseudocolor" arcs, which are arcs that match start-of-string or end-o

Re: Some regular-expression performance hacking

2021-02-13 Thread Tom Lane
"Joel Jacobson" writes: > In total, I scraped the first-page of some ~50k websites, > which produced 45M test rows to import, > which when GROUP BY pattern and flags was reduced > down to 235k different regex patterns, > and 1.5M different text string subjects. This seems like an incredibly usefu

Re: Some regular-expression performance hacking

2021-02-13 Thread Joel Jacobson
Hi Tom, On Thu, Feb 11, 2021, at 05:39, Tom Lane wrote: >0001-invent-rainbow-arcs.patch >0002-recognize-matchall-NFAs.patch Many thanks for working on the regex engine, this looks like an awesome optimization. To test the correctness of the patches, I thought it would be nice with some real-life

Some regular-expression performance hacking

2021-02-10 Thread Tom Lane
As I mentioned in connection with adding the src/test/modules/test_regex test code, I've been fooling with some performance improvements to our regular expression engine. Here's the first fruits of that labor. This is mostly concerned with cutting the overhead for handling trivial unconstrained pa