Btw: if you still run out of cache later with more regexes may be it makes
sense to do prefiltering first my making a single gigantic regexp as
string_agg(‘(‘||name_matches||’)’,’|’) and then only filter ones that match
later. If postgresql provides capturing groups you may even be able to
explode the result without postfilter.

ср, 25 серп. 2021 о 14:22 Jack Christensen <j...@jncsoftware.com> пише:

> The optimizer was a bit too clever. It used the same plan for the LEFT
> JOIN. But that put me on the right track. I tried a LATERAL join. But the
> optimizer saw through that too and used the same plan. So I tried a
> materialized CTE and that finally forced it to use a different plan. That
> made it  run in ~70ms -- about 18x faster. Thanks!
>
> explain analyze
> with r as materialized (
>   select * from matching_rules
>   where id >= 0 and id < 60
> )
> select r.id, i.id
> from r
>   join items i on i.name ~ r.name_matches
> ;
>
>                                                      QUERY PLAN
>
> ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>  Nested Loop  (cost=2.78..714.20 rows=230 width=8) (actual
> time=0.071..69.545 rows=702 loops=1)
>    Join Filter: (i.name ~ r.name_matches)
>    Rows Removed by Join Filter: 45298
>    CTE r
>      ->  Seq Scan on matching_rules  (cost=0.00..2.78 rows=46 width=26)
> (actual time=0.007..0.047 rows=46 loops=1)
>            Filter: ((id >= 0) AND (id < 60))
>            Rows Removed by Filter: 6
>    ->  CTE Scan on r  (cost=0.00..0.92 rows=46 width=36) (actual
> time=0.008..0.090 rows=46 loops=1)
>    ->  Materialize  (cost=0.00..23.00 rows=1000 width=27) (actual
> time=0.000..0.081 rows=1000 loops=46)
>          ->  Seq Scan on items i  (cost=0.00..18.00 rows=1000 width=27)
> (actual time=0.003..0.092 rows=1000 loops=1)
>  Planning Time: 0.206 ms
>  Execution Time: 69.633 ms
>
>
> On Wed, Aug 25, 2021 at 4:05 PM Justin Pryzby <pry...@telsasoft.com>
> wrote:
>
>> On Wed, Aug 25, 2021 at 11:47:43AM -0500, Jack Christensen wrote:
>> > I have items that need to be categorized by user defined matching rules.
>> > Trusted users can create rules that include regular expressions. I've
>> > reduced the problem to this example.
>>
>> > I use the following query to find matches:
>> >
>> > select r.id, i.id
>> > from items i
>> >   join matching_rules r on i.name ~ r.name_matches;
>> >
>> > When there are few rules the query runs quickly. But as the number of
>> rules
>> > increases the runtime often increases at a greater than linear rate.
>>
>> Maybe it's because the REs are cached by RE_compile_and_cache(), but if
>> you
>> loop over the REs in the inner loop, then the caching is ineffecive.
>>
>> Maybe you can force it to join with REs on the outer loop by writing it
>> as:
>> | rules LEFT JOIN items WHERE rules.id IS NOT NULL,
>> ..to improve performance, or at least test that theory.
>>
>> --
>> Justin
>>
>

Reply via email to