On Sat, Feb 13, 2021 at 06:19:34PM +0100, Joel Jacobson wrote:
> To test the correctness of the patches,
> I thought it would be nice with some real-life regexes,
> and just as important, some real-life text strings,
> to which the real-life regexes are applied to.
>
> I therefore patched Chromium
On Fri, Feb 26, 2021, at 19:55, Tom Lane wrote:
> "Joel Jacobson" writes:
> > On Fri, Feb 26, 2021, at 01:16, Tom Lane wrote:
> >> 0007-smarter-regex-allocation-2.patch
>
> > I've successfully tested this patch.
>
> Cool, thanks for testing!
I thought it would be interesting to see if any diffe
"Joel Jacobson" writes:
> On Fri, Feb 26, 2021, at 01:16, Tom Lane wrote:
>> 0007-smarter-regex-allocation-2.patch
> I've successfully tested this patch.
Cool, thanks for testing!
> That's a 29% speed-up compared to HEAD! Truly amazing.
Hmm, I'm still only seeing about 10% or a little better.
Hi,
On Fri, Feb 26, 2021, at 01:16, Tom Lane wrote:
> 0007-smarter-regex-allocation-2.patch
I've successfully tested this patch.
I had to re-create the performance_test table
since some cases the previously didn't give an error,
now gives error "invalid regular expression: invalid character rang
I wrote:
> So I rearranged things to allocate arcs out of a common pool, and for
> good measure made the state allocation code do the same thing. I was
> pretty much blown away by the results: not only is the average-case
> space usage about half what it is on HEAD, but the worst-case drops
> by w
I wrote:
> However, in a different line of thought, I realized that the
> memory allocation logic could use some polishing. It gives out
> ten arcs per NFA state initially, and then adds ten more at a time.
> However, that's not very bright when you look at the actual usage
> patterns, because mos
I wrote:
> On my machine, the combination of these two ideas reduces the
> runtime of the example above from ~150 seconds to ~53 seconds,
> or nearly 3x better. I see something like a 2% improvement on
> Joel's test corpus, which might just be noise. So this isn't
> any sort of universal panacea,
Here's another little piece of regex performance hacking. This is based
on looking at a slow regexp I found in Tcl's bug tracker:
-- Adapted from http://core.tcl.tk/tcl/tktview?name=446565
select regexp_matches(
repeat(' 123 345 123 ',
10),
'))*?(doubleclick|flycast|burstnet|spylog)+?.*?'
Andres Freund writes:
> On 2021-02-23 13:09:18 -0500, Tom Lane wrote:
>> Oddly, I see no such warning with Fedora's current compiler,
>> gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC)
>> Are you using any special compiler switches?
> A few. At first I didn't see any relevant ones - but I th
Andres Freund writes:
> On 2021-02-23 12:52:28 -0500, Tom Lane wrote:
>> ... It is annoying to have to expend
>> an always-on check for a can't-happen case, though.
> Wouldn't quite work like that because of the restrictions of what pg
> infrastructure we want to expose the regex engine to, but a
On 2021-02-23 13:09:18 -0500, Tom Lane wrote:
> Andres Freund writes:
> > One of the recent commits have introduce a new warning with gcc 10, when
> > building with optimizations:
>
> Oddly, I see no such warning with Fedora's current compiler,
> gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC
Andres Freund writes:
> One of the recent commits have introduce a new warning with gcc 10, when
> building with optimizations:
Oddly, I see no such warning with Fedora's current compiler,
gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC)
Are you using any special compiler switches?
Hi,
On 2021-02-23 12:52:28 -0500, Tom Lane wrote:
> I wrote:
> > Hmph. There's an "assert(depth >= 0)" immediately in front of that,
> > so I'm not looking too kindly on the compiler thinking it's smarter
> > than I am. Do you have a suggestion on how to shut it up?
gcc can't see the assert tho
I wrote:
> Hmph. There's an "assert(depth >= 0)" immediately in front of that,
> so I'm not looking too kindly on the compiler thinking it's smarter
> than I am. Do you have a suggestion on how to shut it up?
On reflection, maybe the thing to do is convert the assert into
an always-on check, "if
Andres Freund writes:
> One of the recent commits have introduce a new warning with gcc 10, when
> building with optimizations:
> In file included from
> /home/andres/src/postgresql/src/backend/regex/regcomp.c:2304:
> /home/andres/src/postgresql/src/backend/regex/regc_nfa.c: In function
> ‘chec
Hi,
One of the recent commits have introduce a new warning with gcc 10, when
building with optimizations:
In file included from
/home/andres/src/postgresql/src/backend/regex/regcomp.c:2304:
/home/andres/src/postgresql/src/backend/regex/regc_nfa.c: In function
‘checkmatchall’:
/home/andres/src/p
Chapman Flack writes:
> On 02/19/21 10:26, Tom Lane wrote:
>> Oooh, that's very interesting. I guess the advantage of that over using
>> the 's' flag is that you can have different behaviors at different places
>> in the same regex.
> Perl, Python, and Java (at least) all have a common syntax f
On 02/19/21 10:26, Tom Lane wrote:
>> "foo\nbar".match(/([\w\W]+)/)[1];
>> "foo
>> bar"
>
> Oooh, that's very interesting. I guess the advantage of that over using
> the 's' flag is that you can have different behaviors at different places
> in the same regex.
Perl, Python, and Java (at least)
On Fri, Feb 19, 2021, at 16:26, Tom Lane wrote:
>"Joel Jacobson" writes:
>> On Thu, Feb 18, 2021, at 19:53, Tom Lane wrote:
>>> (Having said that, I can't help noticing that a very large fraction
>>> of those usages look like, eg, "[\w\W]". It seems to me that that's
>>> a very expensive and unwi
"Joel Jacobson" writes:
> On Thu, Feb 18, 2021, at 19:53, Tom Lane wrote:
>> (Having said that, I can't help noticing that a very large fraction
>> of those usages look like, eg, "[\w\W]". It seems to me that that's
>> a very expensive and unwieldy way to spell ".". Am I missing
>> something abo
On Thu, Feb 18, 2021, at 19:53, Tom Lane wrote:
>(Having said that, I can't help noticing that a very large fraction
>of those usages look like, eg, "[\w\W]". It seems to me that that's
>a very expensive and unwieldy way to spell ".". Am I missing
>something about what that does in Javascript?)
On Thu, Feb 18, 2021, at 21:44, Tom Lane wrote:
>Hmm ... I might be misunderstanding, but I think our engine already
>does a version of this. See the discussion of "colors" in
>src/backend/regex/README.
Thanks, I will read it with great interest.
>Maybe. In practice the actual scanning tends to
"Joel Jacobson" writes:
> Let's see if I can explain the idea:
> One of the problems with representing regexes with large bracket range
> expressions, like [a-z],
> is you get an explosion of edges, if the model can only represent state
> transitions for single characters.
> If we could instead
On Thu, Feb 18, 2021, at 20:58, Joel Jacobson wrote:
>Like you said earlier, perhaps the regex engine has been optimized enough for
>this time.
>If not, you want to investigate an additional idea,
In the above sentence, I meant "you _may_ want to".
I'm not at all sure these idea are applicable in
On Thu, Feb 18, 2021, at 19:10, Tom Lane wrote:
> "Joel Jacobson" writes:
> >> I've produced a new dataset which now also includes the regex flags (if
> >> any) used for each subject applied to a pattern.
>
> Again, thanks for collecting this data! I'm a little confused about
> how you produced
I thought it was worth looking a little more closely at the error
cases in this set of tests, as a form of competition analysis versus
Javascript's regex engine. I ran through the cases that gave errors,
and pinned down exactly what was causing the error for as many cases
as I could. (These resul
"Joel Jacobson" writes:
>> I've produced a new dataset which now also includes the regex flags (if
>> any) used for each subject applied to a pattern.
Again, thanks for collecting this data! I'm a little confused about
how you produced the results in the "tests" table, though. It sort
of looks
On Thu, Feb 18, 2021, at 11:30, Joel Jacobson wrote:
>SELECT * FROM vdeviations;
>-[ RECORD 1
>]+---
>pattern | \.(ac|com\.ac|edu\.ac|gov\.ac|net\.ac|mi ... 100497 chars
>... abs\.org|
On Wed, Feb 17, 2021, at 22:00, Tom Lane wrote:
> Attached is an updated patch series; it's rebased over 4e703d671
> which took care of some not-really-related fixes, and I made a
> pass of cleanup and comment improvements. I think this is pretty
> much ready to commit, unless you want to do more
"Joel Jacobson" writes:
> I've tested all 4 patches successfully.
Thanks!
I found one other area that could be improved with the same idea of
getting rid of unnecessary subre's: right now, every pair of capturing
parentheses gives rise to a "capture" subre with an NFA identical to
its single chi
On Mon, Feb 15, 2021, at 04:11, Tom Lane wrote:
>I got these runtimes (non-cassert builds):
>
>HEAD 313661.149 ms (05:13.661)
>+0001 297397.293 ms (04:57.397) 5% better than HEAD
>+0002 151995.803 ms (02:31.996) 51% better than HEAD
>+0003 139843.934 ms (02:19.844) 55% better than HEAD
>+0004 95034
"Joel Jacobson" writes:
> Below is the result of the performance test query:
> -- 8facf1ea00b7a0c08c755a0392212b83e04ae28a:
> Time: 592196.145 ms (09:52.196)
> -- 8facf1ea00b7a0c08c755a0392212b83e04ae28a+patches:
> Time: 461739.364 ms (07:41.739)
> That's an impressive 22% speed-up!
I've been doi
"Joel Jacobson" writes:
> I've successfully tested both patches against the 1.5M regexes-in-the-wild
> dataset.
> Out of the 1489489 (pattern, text string) pairs tested,
> there was only one single deviation:
> This 100577 bytes big regex (pattern_id = 207811)...
> ...
> ...previously raised...
>
On Sat, Feb 13, 2021, at 22:11, Tom Lane wrote:
>0001-invent-rainbow-arcs-2.patch
>0002-recognize-matchall-NFAs-2.patch
I've successfully tested both patches against the 1.5M regexes-in-the-wild
dataset.
Out of the 1489489 (pattern, text string) pairs tested,
there was only one single deviation:
"Joel Jacobson" writes:
> No is_match differences were detected, good!
> However, there were 23 cases where what got captured differed:
These all stem from the same oversight: checkmatchall() was being
too cavalier by ignoring "pseudocolor" arcs, which are arcs that
match start-of-string or end-o
"Joel Jacobson" writes:
> In total, I scraped the first-page of some ~50k websites,
> which produced 45M test rows to import,
> which when GROUP BY pattern and flags was reduced
> down to 235k different regex patterns,
> and 1.5M different text string subjects.
This seems like an incredibly usefu
Hi Tom,
On Thu, Feb 11, 2021, at 05:39, Tom Lane wrote:
>0001-invent-rainbow-arcs.patch
>0002-recognize-matchall-NFAs.patch
Many thanks for working on the regex engine,
this looks like an awesome optimization.
To test the correctness of the patches,
I thought it would be nice with some real-life
37 matches
Mail list logo