On Sat, Feb 13, 2021 at 06:19:34PM +0100, Joel Jacobson wrote:
> To test the correctness of the patches,
> I thought it would be nice with some real-life regexes,
> and just as important, some real-life text strings,
> to which the real-life regexes are applied to.
>
> I therefore patched Chromium
On Fri, Feb 26, 2021, at 19:55, Tom Lane wrote:
> "Joel Jacobson" writes:
> > On Fri, Feb 26, 2021, at 01:16, Tom Lane wrote:
> >> 0007-smarter-regex-allocation-2.patch
>
> > I've successfully tested this patch.
>
> Cool, thanks for testing!
I thought it would be interesting to see if any diffe
"Joel Jacobson" writes:
> On Fri, Feb 26, 2021, at 01:16, Tom Lane wrote:
>> 0007-smarter-regex-allocation-2.patch
> I've successfully tested this patch.
Cool, thanks for testing!
> That's a 29% speed-up compared to HEAD! Truly amazing.
Hmm, I'm still only seeing about 10% or a little better.
Hi,
On Fri, Feb 26, 2021, at 01:16, Tom Lane wrote:
> 0007-smarter-regex-allocation-2.patch
I've successfully tested this patch.
I had to re-create the performance_test table
since some cases the previously didn't give an error,
now gives error "invalid regular expression: invalid character rang
I wrote:
> So I rearranged things to allocate arcs out of a common pool, and for
> good measure made the state allocation code do the same thing. I was
> pretty much blown away by the results: not only is the average-case
> space usage about half what it is on HEAD, but the worst-case drops
> by w
I wrote:
> However, in a different line of thought, I realized that the
> memory allocation logic could use some polishing. It gives out
> ten arcs per NFA state initially, and then adds ten more at a time.
> However, that's not very bright when you look at the actual usage
> patterns, because mos
I wrote:
> On my machine, the combination of these two ideas reduces the
> runtime of the example above from ~150 seconds to ~53 seconds,
> or nearly 3x better. I see something like a 2% improvement on
> Joel's test corpus, which might just be noise. So this isn't
> any sort of universal panacea,
Here's another little piece of regex performance hacking. This is based
on looking at a slow regexp I found in Tcl's bug tracker:
-- Adapted from http://core.tcl.tk/tcl/tktview?name=446565
select regexp_matches(
repeat(' 123 345 123 ',
10),
'))*?(doubleclick|flycast|burstnet|spylog)+?.*?'
Andres Freund writes:
> On 2021-02-23 13:09:18 -0500, Tom Lane wrote:
>> Oddly, I see no such warning with Fedora's current compiler,
>> gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC)
>> Are you using any special compiler switches?
> A few. At first I didn't see any relevant ones - but I th
Andres Freund writes:
> On 2021-02-23 12:52:28 -0500, Tom Lane wrote:
>> ... It is annoying to have to expend
>> an always-on check for a can't-happen case, though.
> Wouldn't quite work like that because of the restrictions of what pg
> infrastructure we want to expose the regex engine to, but a
On 2021-02-23 13:09:18 -0500, Tom Lane wrote:
> Andres Freund writes:
> > One of the recent commits have introduce a new warning with gcc 10, when
> > building with optimizations:
>
> Oddly, I see no such warning with Fedora's current compiler,
> gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC
Andres Freund writes:
> One of the recent commits have introduce a new warning with gcc 10, when
> building with optimizations:
Oddly, I see no such warning with Fedora's current compiler,
gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC)
Are you using any special compiler switches?
Hi,
On 2021-02-23 12:52:28 -0500, Tom Lane wrote:
> I wrote:
> > Hmph. There's an "assert(depth >= 0)" immediately in front of that,
> > so I'm not looking too kindly on the compiler thinking it's smarter
> > than I am. Do you have a suggestion on how to shut it up?
gcc can't see the assert tho
I wrote:
> Hmph. There's an "assert(depth >= 0)" immediately in front of that,
> so I'm not looking too kindly on the compiler thinking it's smarter
> than I am. Do you have a suggestion on how to shut it up?
On reflection, maybe the thing to do is convert the assert into
an always-on check, "if
Andres Freund writes:
> One of the recent commits have introduce a new warning with gcc 10, when
> building with optimizations:
> In file included from
> /home/andres/src/postgresql/src/backend/regex/regcomp.c:2304:
> /home/andres/src/postgresql/src/backend/regex/regc_nfa.c: In function
> ‘chec
Hi,
One of the recent commits have introduce a new warning with gcc 10, when
building with optimizations:
In file included from
/home/andres/src/postgresql/src/backend/regex/regcomp.c:2304:
/home/andres/src/postgresql/src/backend/regex/regc_nfa.c: In function
‘checkmatchall’:
/home/andres/src/p
Chapman Flack writes:
> On 02/19/21 10:26, Tom Lane wrote:
>> Oooh, that's very interesting. I guess the advantage of that over using
>> the 's' flag is that you can have different behaviors at different places
>> in the same regex.
> Perl, Python, and Java (at least) all have a common syntax f
On 02/19/21 10:26, Tom Lane wrote:
>> "foo\nbar".match(/([\w\W]+)/)[1];
>> "foo
>> bar"
>
> Oooh, that's very interesting. I guess the advantage of that over using
> the 's' flag is that you can have different behaviors at different places
> in the same regex.
Perl, Python, and Java (at least)
On Fri, Feb 19, 2021, at 16:26, Tom Lane wrote:
>"Joel Jacobson" writes:
>> On Thu, Feb 18, 2021, at 19:53, Tom Lane wrote:
>>> (Having said that, I can't help noticing that a very large fraction
>>> of those usages look like, eg, "[\w\W]". It seems to me that that's
>>> a very expensive and unwi
"Joel Jacobson" writes:
> On Thu, Feb 18, 2021, at 19:53, Tom Lane wrote:
>> (Having said that, I can't help noticing that a very large fraction
>> of those usages look like, eg, "[\w\W]". It seems to me that that's
>> a very expensive and unwieldy way to spell ".". Am I missing
>> something abo
On Thu, Feb 18, 2021, at 19:53, Tom Lane wrote:
>(Having said that, I can't help noticing that a very large fraction
>of those usages look like, eg, "[\w\W]". It seems to me that that's
>a very expensive and unwieldy way to spell ".". Am I missing
>something about what that does in Javascript?)
On Thu, Feb 18, 2021, at 21:44, Tom Lane wrote:
>Hmm ... I might be misunderstanding, but I think our engine already
>does a version of this. See the discussion of "colors" in
>src/backend/regex/README.
Thanks, I will read it with great interest.
>Maybe. In practice the actual scanning tends to
"Joel Jacobson" writes:
> Let's see if I can explain the idea:
> One of the problems with representing regexes with large bracket range
> expressions, like [a-z],
> is you get an explosion of edges, if the model can only represent state
> transitions for single characters.
> If we could instead
On Thu, Feb 18, 2021, at 20:58, Joel Jacobson wrote:
>Like you said earlier, perhaps the regex engine has been optimized enough for
>this time.
>If not, you want to investigate an additional idea,
In the above sentence, I meant "you _may_ want to".
I'm not at all sure these idea are applicable in
On Thu, Feb 18, 2021, at 19:10, Tom Lane wrote:
> "Joel Jacobson" writes:
> >> I've produced a new dataset which now also includes the regex flags (if
> >> any) used for each subject applied to a pattern.
>
> Again, thanks for collecting this data! I'm a little confused about
> how you produced
I thought it was worth looking a little more closely at the error
cases in this set of tests, as a form of competition analysis versus
Javascript's regex engine. I ran through the cases that gave errors,
and pinned down exactly what was causing the error for as many cases
as I could. (These resul
"Joel Jacobson" writes:
>> I've produced a new dataset which now also includes the regex flags (if
>> any) used for each subject applied to a pattern.
Again, thanks for collecting this data! I'm a little confused about
how you produced the results in the "tests" table, though. It sort
of looks
On Thu, Feb 18, 2021, at 11:30, Joel Jacobson wrote:
>SELECT * FROM vdeviations;
>-[ RECORD 1
>]+---
>pattern | \.(ac|com\.ac|edu\.ac|gov\.ac|net\.ac|mi ... 100497 chars
>... abs\.org|
On Wed, Feb 17, 2021, at 22:00, Tom Lane wrote:
> Attached is an updated patch series; it's rebased over 4e703d671
> which took care of some not-really-related fixes, and I made a
> pass of cleanup and comment improvements. I think this is pretty
> much ready to commit, unless you want to do more
"Joel Jacobson" writes:
> I've tested all 4 patches successfully.
Thanks!
I found one other area that could be improved with the same idea of
getting rid of unnecessary subre's: right now, every pair of capturing
parentheses gives rise to a "capture" subre with an NFA identical to
its single chi
On Mon, Feb 15, 2021, at 04:11, Tom Lane wrote:
>I got these runtimes (non-cassert builds):
>
>HEAD 313661.149 ms (05:13.661)
>+0001 297397.293 ms (04:57.397) 5% better than HEAD
>+0002 151995.803 ms (02:31.996) 51% better than HEAD
>+0003 139843.934 ms (02:19.844) 55% better than HEAD
>+0004 95034
"Joel Jacobson" writes:
> Below is the result of the performance test query:
> -- 8facf1ea00b7a0c08c755a0392212b83e04ae28a:
> Time: 592196.145 ms (09:52.196)
> -- 8facf1ea00b7a0c08c755a0392212b83e04ae28a+patches:
> Time: 461739.364 ms (07:41.739)
> That's an impressive 22% speed-up!
I've been doi
"Joel Jacobson" writes:
> I've successfully tested both patches against the 1.5M regexes-in-the-wild
> dataset.
> Out of the 1489489 (pattern, text string) pairs tested,
> there was only one single deviation:
> This 100577 bytes big regex (pattern_id = 207811)...
> ...
> ...previously raised...
>
On Sat, Feb 13, 2021, at 22:11, Tom Lane wrote:
>0001-invent-rainbow-arcs-2.patch
>0002-recognize-matchall-NFAs-2.patch
I've successfully tested both patches against the 1.5M regexes-in-the-wild
dataset.
Out of the 1489489 (pattern, text string) pairs tested,
there was only one single deviation:
"Joel Jacobson" writes:
> No is_match differences were detected, good!
> However, there were 23 cases where what got captured differed:
These all stem from the same oversight: checkmatchall() was being
too cavalier by ignoring "pseudocolor" arcs, which are arcs that
match start-of-string or end-o
"Joel Jacobson" writes:
> In total, I scraped the first-page of some ~50k websites,
> which produced 45M test rows to import,
> which when GROUP BY pattern and flags was reduced
> down to 235k different regex patterns,
> and 1.5M different text string subjects.
This seems like an incredibly usefu
Hi Tom,
On Thu, Feb 11, 2021, at 05:39, Tom Lane wrote:
>0001-invent-rainbow-arcs.patch
>0002-recognize-matchall-NFAs.patch
Many thanks for working on the regex engine,
this looks like an awesome optimization.
To test the correctness of the patches,
I thought it would be nice with some real-life
As I mentioned in connection with adding the src/test/modules/test_regex
test code, I've been fooling with some performance improvements to our
regular expression engine. Here's the first fruits of that labor.
This is mostly concerned with cutting the overhead for handling trivial
unconstrained pa
38 matches
Mail list logo