Re: Some regular-expression performance hacking

2021-03-06 Thread Noah Misch
On Sat, Feb 13, 2021 at 06:19:34PM +0100, Joel Jacobson wrote: > To test the correctness of the patches, > I thought it would be nice with some real-life regexes, > and just as important, some real-life text strings, > to which the real-life regexes are applied to. > > I therefore patched Chromium

Re: Some regular-expression performance hacking

2021-03-05 Thread Joel Jacobson
On Fri, Feb 26, 2021, at 19:55, Tom Lane wrote: > "Joel Jacobson" writes: > > On Fri, Feb 26, 2021, at 01:16, Tom Lane wrote: > >> 0007-smarter-regex-allocation-2.patch > > > I've successfully tested this patch. > > Cool, thanks for testing! I thought it would be interesting to see if any diffe

Re: Some regular-expression performance hacking

2021-02-26 Thread Tom Lane
"Joel Jacobson" writes: > On Fri, Feb 26, 2021, at 01:16, Tom Lane wrote: >> 0007-smarter-regex-allocation-2.patch > I've successfully tested this patch. Cool, thanks for testing! > That's a 29% speed-up compared to HEAD! Truly amazing. Hmm, I'm still only seeing about 10% or a little better.

Re: Some regular-expression performance hacking

2021-02-26 Thread Joel Jacobson
Hi, On Fri, Feb 26, 2021, at 01:16, Tom Lane wrote: > 0007-smarter-regex-allocation-2.patch I've successfully tested this patch. I had to re-create the performance_test table since some cases the previously didn't give an error, now gives error "invalid regular expression: invalid character rang

Re: Some regular-expression performance hacking

2021-02-25 Thread Tom Lane
I wrote: > So I rearranged things to allocate arcs out of a common pool, and for > good measure made the state allocation code do the same thing. I was > pretty much blown away by the results: not only is the average-case > space usage about half what it is on HEAD, but the worst-case drops > by w

Re: Some regular-expression performance hacking

2021-02-25 Thread Tom Lane
I wrote: > However, in a different line of thought, I realized that the > memory allocation logic could use some polishing. It gives out > ten arcs per NFA state initially, and then adds ten more at a time. > However, that's not very bright when you look at the actual usage > patterns, because mos

Re: Some regular-expression performance hacking

2021-02-24 Thread Tom Lane
I wrote: > On my machine, the combination of these two ideas reduces the > runtime of the example above from ~150 seconds to ~53 seconds, > or nearly 3x better. I see something like a 2% improvement on > Joel's test corpus, which might just be noise. So this isn't > any sort of universal panacea,

Re: Some regular-expression performance hacking

2021-02-23 Thread Tom Lane
Here's another little piece of regex performance hacking. This is based on looking at a slow regexp I found in Tcl's bug tracker: -- Adapted from http://core.tcl.tk/tcl/tktview?name=446565 select regexp_matches( repeat(' 123 345 123 ', 10), '))*?(doubleclick|flycast|burstnet|spylog)+?.*?'

Re: Some regular-expression performance hacking

2021-02-23 Thread Tom Lane
Andres Freund writes: > On 2021-02-23 13:09:18 -0500, Tom Lane wrote: >> Oddly, I see no such warning with Fedora's current compiler, >> gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC) >> Are you using any special compiler switches? > A few. At first I didn't see any relevant ones - but I th

Re: Some regular-expression performance hacking

2021-02-23 Thread Tom Lane
Andres Freund writes: > On 2021-02-23 12:52:28 -0500, Tom Lane wrote: >> ... It is annoying to have to expend >> an always-on check for a can't-happen case, though. > Wouldn't quite work like that because of the restrictions of what pg > infrastructure we want to expose the regex engine to, but a

Re: Some regular-expression performance hacking

2021-02-23 Thread Andres Freund
On 2021-02-23 13:09:18 -0500, Tom Lane wrote: > Andres Freund writes: > > One of the recent commits have introduce a new warning with gcc 10, when > > building with optimizations: > > Oddly, I see no such warning with Fedora's current compiler, > gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC

Re: Some regular-expression performance hacking

2021-02-23 Thread Tom Lane
Andres Freund writes: > One of the recent commits have introduce a new warning with gcc 10, when > building with optimizations: Oddly, I see no such warning with Fedora's current compiler, gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC) Are you using any special compiler switches?

Re: Some regular-expression performance hacking

2021-02-23 Thread Andres Freund
Hi, On 2021-02-23 12:52:28 -0500, Tom Lane wrote: > I wrote: > > Hmph. There's an "assert(depth >= 0)" immediately in front of that, > > so I'm not looking too kindly on the compiler thinking it's smarter > > than I am. Do you have a suggestion on how to shut it up? gcc can't see the assert tho

Re: Some regular-expression performance hacking

2021-02-23 Thread Tom Lane
I wrote: > Hmph. There's an "assert(depth >= 0)" immediately in front of that, > so I'm not looking too kindly on the compiler thinking it's smarter > than I am. Do you have a suggestion on how to shut it up? On reflection, maybe the thing to do is convert the assert into an always-on check, "if

Re: Some regular-expression performance hacking

2021-02-23 Thread Tom Lane
Andres Freund writes: > One of the recent commits have introduce a new warning with gcc 10, when > building with optimizations: > In file included from > /home/andres/src/postgresql/src/backend/regex/regcomp.c:2304: > /home/andres/src/postgresql/src/backend/regex/regc_nfa.c: In function > ‘chec

Re: Some regular-expression performance hacking

2021-02-23 Thread Andres Freund
Hi, One of the recent commits have introduce a new warning with gcc 10, when building with optimizations: In file included from /home/andres/src/postgresql/src/backend/regex/regcomp.c:2304: /home/andres/src/postgresql/src/backend/regex/regc_nfa.c: In function ‘checkmatchall’: /home/andres/src/p

Re: Some regular-expression performance hacking

2021-02-20 Thread Tom Lane
Chapman Flack writes: > On 02/19/21 10:26, Tom Lane wrote: >> Oooh, that's very interesting. I guess the advantage of that over using >> the 's' flag is that you can have different behaviors at different places >> in the same regex. > Perl, Python, and Java (at least) all have a common syntax f

Re: Some regular-expression performance hacking

2021-02-20 Thread Chapman Flack
On 02/19/21 10:26, Tom Lane wrote: >> "foo\nbar".match(/([\w\W]+)/)[1]; >> "foo >> bar" > > Oooh, that's very interesting. I guess the advantage of that over using > the 's' flag is that you can have different behaviors at different places > in the same regex. Perl, Python, and Java (at least)

Re: Some regular-expression performance hacking

2021-02-20 Thread Joel Jacobson
On Fri, Feb 19, 2021, at 16:26, Tom Lane wrote: >"Joel Jacobson" writes: >> On Thu, Feb 18, 2021, at 19:53, Tom Lane wrote: >>> (Having said that, I can't help noticing that a very large fraction >>> of those usages look like, eg, "[\w\W]". It seems to me that that's >>> a very expensive and unwi

Re: Some regular-expression performance hacking

2021-02-19 Thread Tom Lane
"Joel Jacobson" writes: > On Thu, Feb 18, 2021, at 19:53, Tom Lane wrote: >> (Having said that, I can't help noticing that a very large fraction >> of those usages look like, eg, "[\w\W]". It seems to me that that's >> a very expensive and unwieldy way to spell ".". Am I missing >> something abo

Re: Some regular-expression performance hacking

2021-02-19 Thread Joel Jacobson
On Thu, Feb 18, 2021, at 19:53, Tom Lane wrote: >(Having said that, I can't help noticing that a very large fraction >of those usages look like, eg, "[\w\W]". It seems to me that that's >a very expensive and unwieldy way to spell ".". Am I missing >something about what that does in Javascript?)

Re: Some regular-expression performance hacking

2021-02-18 Thread Joel Jacobson
On Thu, Feb 18, 2021, at 21:44, Tom Lane wrote: >Hmm ... I might be misunderstanding, but I think our engine already >does a version of this. See the discussion of "colors" in >src/backend/regex/README. Thanks, I will read it with great interest. >Maybe. In practice the actual scanning tends to

Re: Some regular-expression performance hacking

2021-02-18 Thread Tom Lane
"Joel Jacobson" writes: > Let's see if I can explain the idea: > One of the problems with representing regexes with large bracket range > expressions, like [a-z], > is you get an explosion of edges, if the model can only represent state > transitions for single characters. > If we could instead

Re: Some regular-expression performance hacking

2021-02-18 Thread Joel Jacobson
On Thu, Feb 18, 2021, at 20:58, Joel Jacobson wrote: >Like you said earlier, perhaps the regex engine has been optimized enough for >this time. >If not, you want to investigate an additional idea, In the above sentence, I meant "you _may_ want to". I'm not at all sure these idea are applicable in

Re: Some regular-expression performance hacking

2021-02-18 Thread Joel Jacobson
On Thu, Feb 18, 2021, at 19:10, Tom Lane wrote: > "Joel Jacobson" writes: > >> I've produced a new dataset which now also includes the regex flags (if > >> any) used for each subject applied to a pattern. > > Again, thanks for collecting this data! I'm a little confused about > how you produced

Re: Some regular-expression performance hacking

2021-02-18 Thread Tom Lane
I thought it was worth looking a little more closely at the error cases in this set of tests, as a form of competition analysis versus Javascript's regex engine. I ran through the cases that gave errors, and pinned down exactly what was causing the error for as many cases as I could. (These resul

Re: Some regular-expression performance hacking

2021-02-18 Thread Tom Lane
"Joel Jacobson" writes: >> I've produced a new dataset which now also includes the regex flags (if >> any) used for each subject applied to a pattern. Again, thanks for collecting this data! I'm a little confused about how you produced the results in the "tests" table, though. It sort of looks

Re: Some regular-expression performance hacking

2021-02-18 Thread Joel Jacobson
On Thu, Feb 18, 2021, at 11:30, Joel Jacobson wrote: >SELECT * FROM vdeviations; >-[ RECORD 1 >]+--- >pattern | \.(ac|com\.ac|edu\.ac|gov\.ac|net\.ac|mi ... 100497 chars >... abs\.org|

Re: Some regular-expression performance hacking

2021-02-18 Thread Joel Jacobson
On Wed, Feb 17, 2021, at 22:00, Tom Lane wrote: > Attached is an updated patch series; it's rebased over 4e703d671 > which took care of some not-really-related fixes, and I made a > pass of cleanup and comment improvements. I think this is pretty > much ready to commit, unless you want to do more

Re: Some regular-expression performance hacking

2021-02-17 Thread Tom Lane
"Joel Jacobson" writes: > I've tested all 4 patches successfully. Thanks! I found one other area that could be improved with the same idea of getting rid of unnecessary subre's: right now, every pair of capturing parentheses gives rise to a "capture" subre with an NFA identical to its single chi

Re: Some regular-expression performance hacking

2021-02-15 Thread Joel Jacobson
On Mon, Feb 15, 2021, at 04:11, Tom Lane wrote: >I got these runtimes (non-cassert builds): > >HEAD 313661.149 ms (05:13.661) >+0001 297397.293 ms (04:57.397) 5% better than HEAD >+0002 151995.803 ms (02:31.996) 51% better than HEAD >+0003 139843.934 ms (02:19.844) 55% better than HEAD >+0004 95034

Re: Some regular-expression performance hacking

2021-02-14 Thread Tom Lane
"Joel Jacobson" writes: > Below is the result of the performance test query: > -- 8facf1ea00b7a0c08c755a0392212b83e04ae28a: > Time: 592196.145 ms (09:52.196) > -- 8facf1ea00b7a0c08c755a0392212b83e04ae28a+patches: > Time: 461739.364 ms (07:41.739) > That's an impressive 22% speed-up! I've been doi

Re: Some regular-expression performance hacking

2021-02-14 Thread Tom Lane
"Joel Jacobson" writes: > I've successfully tested both patches against the 1.5M regexes-in-the-wild > dataset. > Out of the 1489489 (pattern, text string) pairs tested, > there was only one single deviation: > This 100577 bytes big regex (pattern_id = 207811)... > ... > ...previously raised... >

Re: Some regular-expression performance hacking

2021-02-14 Thread Joel Jacobson
On Sat, Feb 13, 2021, at 22:11, Tom Lane wrote: >0001-invent-rainbow-arcs-2.patch >0002-recognize-matchall-NFAs-2.patch I've successfully tested both patches against the 1.5M regexes-in-the-wild dataset. Out of the 1489489 (pattern, text string) pairs tested, there was only one single deviation:

Re: Some regular-expression performance hacking

2021-02-13 Thread Tom Lane
"Joel Jacobson" writes: > No is_match differences were detected, good! > However, there were 23 cases where what got captured differed: These all stem from the same oversight: checkmatchall() was being too cavalier by ignoring "pseudocolor" arcs, which are arcs that match start-of-string or end-o

Re: Some regular-expression performance hacking

2021-02-13 Thread Tom Lane
"Joel Jacobson" writes: > In total, I scraped the first-page of some ~50k websites, > which produced 45M test rows to import, > which when GROUP BY pattern and flags was reduced > down to 235k different regex patterns, > and 1.5M different text string subjects. This seems like an incredibly usefu

Re: Some regular-expression performance hacking

2021-02-13 Thread Joel Jacobson
Hi Tom, On Thu, Feb 11, 2021, at 05:39, Tom Lane wrote: >0001-invent-rainbow-arcs.patch >0002-recognize-matchall-NFAs.patch Many thanks for working on the regex engine, this looks like an awesome optimization. To test the correctness of the patches, I thought it would be nice with some real-life