Re: Bloom filters and the planner / parallel execution

Rafia Sabih Tue, 05 Nov 2024 05:30:12 -0800

This looks interesting. Looking at the costs, clearly bloom index is too
costly as per planner's estimate.
Just skimming through the code, this line in blcostestimate,


 /* We have to visit all index tuples anyway */
costs.numIndexTuples = index->tuples;

looks like one of the reasons for this behaviour.


On Tue, 8 Oct 2024 at 13:56, Daniel Westermann (DWE) <
daniel.westerm...@dbi-services.com> wrote:

> Hi,
>
> previous discussion at [1].
>
> Initially I wanted to send this to the docs mailing list, but while
> testing this further I think this is more an issue with the planner.
>
> The example(s) in the docs for bloom filters (
> https://www.postgresql.org/docs/current/bloom.html) cannot be reproduced
> at all (and the size of the indexes are far off the truth as well).
>
> postgres=# select version();
>                                            version
>
>
> ----------------------------------------------------------------------------------------------
>  PostgreSQL 18devel on x86_64-pc-linux-gnu, compiled by gcc (Debian
> 12.2.0-14) 12.2.0, 64-bit
> (1 row)
>
> postgres=# CREATE TABLE tbloom AS
>    SELECT
>      (random() * 1000000)::int as i1,
>      (random() * 1000000)::int as i2,
>      (random() * 1000000)::int as i3,
>      (random() * 1000000)::int as i4,
>      (random() * 1000000)::int as i5,
>      (random() * 1000000)::int as i6
>    FROM
>   generate_series(1,10000000);
> SELECT 10000000
> postgres=# CREATE INDEX bloomidx ON tbloom USING bloom (i1, i2, i3, i4,
> i5, i6);
> CREATE INDEX
> postgres=# analyze tbloom ;
> ANALYZE
> postgres=# SELECT pg_size_pretty(pg_relation_size('bloomidx'));
>  pg_size_pretty
> ----------------
>  153 MB
> (1 row)
>
> The following statement (from the docs as well) will never use the bloom
> index but will go for parallel execution:
>
> postgres=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 =
> 123451;
>                                                        QUERY PLAN
>
>
> -------------------------------------------------------------------------------------------------------------------------
>  Gather  (cost=1000.00..127244.10 rows=1 width=24) (actual
> time=155.618..155.652 rows=0 loops=1)
>    Workers Planned: 2
>    Workers Launched: 2
>    ->  Parallel Seq Scan on tbloom  (cost=0.00..126244.00 rows=1 width=24)
> (actual time=145.353..145.354 rows=0 loops=3)
>          Filter: ((i2 = 898732) AND (i5 = 123451))
>          Rows Removed by Filter: 3333333
>  Planning Time: 0.241 ms
>  Execution Time: 155.673 ms
> (8 rows)
>
> The index will be used if we make parallel execution a lot more costly:
>
> postgres=# set parallel_setup_cost = 100000;
> SET
> postgres=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 =
> 123451;
>                                                         QUERY PLAN
>
>
> ---------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on tbloom  (cost=178436.00..178440.02 rows=1 width=24)
> (actual time=31.364..31.365 rows=0 loops=1)
>    Recheck Cond: ((i2 = 898732) AND (i5 = 123451))
>    Rows Removed by Index Recheck: 2424
>    Heap Blocks: exact=2372
>    ->  Bitmap Index Scan on bloomidx  (cost=0.00..178436.00 rows=1
> width=0) (actual time=26.387..26.387 rows=2424 loops=1)
>          Index Cond: ((i2 = 898732) AND (i5 = 123451))
>  Planning Time: 0.210 ms
>  Execution Time: 31.511 ms
> (8 rows)
>
> postgres=# reset parallel_setup_cost ;
> RESET
> postgres=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 =
> 123451;
>                                                        QUERY PLAN
>
>
> -------------------------------------------------------------------------------------------------------------------------
>  Gather  (cost=1000.00..127244.10 rows=1 width=24) (actual
> time=150.802..150.835 rows=0 loops=1)
>    Workers Planned: 2
>    Workers Launched: 2
>    ->  Parallel Seq Scan on tbloom  (cost=0.00..126244.00 rows=1 width=24)
> (actual time=143.902..143.903 rows=0 loops=3)
>          Filter: ((i2 = 898732) AND (i5 = 123451))
>          Rows Removed by Filter: 3333333
>  Planning Time: 0.172 ms
>  Execution Time: 150.865 ms
> (8 rows)
>
> The reason I am sending this to this list is the execution time. The plan
> using the index is faster than the plan using parallel execution, but it is
> not the one executed in the default configuration. Am I missing something?
>
> Here is another example where the index should clearly win:
>
> postgres=# create table tdummy as select i as a, i as b, i as c, i as d, i
> as e, i as f, i as g, i as h from generate_series (1, 20000000) i;
> SELECT 20000000
> postgres=# CREATE INDEX bloomidx2 ON tdummy USING bloom (a,b,c,d,e,f,g,h);
> CREATE INDEX
> postgres=# analyze tdummy ;
> ANALYZE
> postgres=# explain analyze select * from tdummy where a = 1 and b = 1 and
> c = 1 and d = 1 and e = 1;
>                                                        QUERY PLAN
>
>
> -------------------------------------------------------------------------------------------------------------------------
>  Gather  (cost=1000.00..335576.41 rows=1 width=32) (actual
> time=0.573..437.776 rows=1 loops=1)
>    Workers Planned: 2
>    Workers Launched: 2
>    ->  Parallel Seq Scan on tdummy  (cost=0.00..334576.31 rows=1 width=32)
> (actual time=285.408..430.800 rows=0 loops=3)
>          Filter: ((a = 1) AND (b = 1) AND (c = 1) AND (d = 1) AND (e = 1))
>          Rows Removed by Filter: 6666666
>  Planning Time: 0.552 ms
>  Execution Time: 437.817 ms
> (8 rows)
>
> postgres=# set parallel_setup_cost = 10000;
> SET
> postgres=# explain analyze select * from tdummy where a = 1 and b = 1 and
> c = 1 and d = 1 and e = 1;
>                                                        QUERY PLAN
>
>
> -------------------------------------------------------------------------------------------------------------------------
>  Gather  (cost=10000.00..344576.41 rows=1 width=32) (actual
> time=0.439..316.780 rows=1 loops=1)
>    Workers Planned: 2
>    Workers Launched: 2
>    ->  Parallel Seq Scan on tdummy  (cost=0.00..334576.31 rows=1 width=32)
> (actual time=206.356..311.511 rows=0 loops=3)
>          Filter: ((a = 1) AND (b = 1) AND (c = 1) AND (d = 1) AND (e = 1))
>          Rows Removed by Filter: 6666666
>  Planning Time: 0.197 ms
>  Execution Time: 316.807 ms
> (8 rows)
>
> postgres=# explain analyze select * from tdummy where a = 1 and b = 1 and
> c = 1 and d = 1 and e = 1;
>                                                        QUERY PLAN
>
>
> -------------------------------------------------------------------------------------------------------------------------
>  Gather  (cost=10000.00..344572.10 rows=1 width=32) (actual
> time=1.284..291.015 rows=1 loops=1)
>    Workers Planned: 2
>    Workers Launched: 2
>    ->  Parallel Seq Scan on tdummy  (cost=0.00..334572.00 rows=1 width=32)
> (actual time=188.019..284.259 rows=0 loops=3)
>          Filter: ((a = 1) AND (b = 1) AND (c = 1) AND (d = 1) AND (e = 1))
>          Rows Removed by Filter: 6666666
>  Planning Time: 0.311 ms
>  Execution Time: 291.055 ms
> (8 rows)
>
> postgres=# set parallel_setup_cost = 100000;
> SET
> postgres=# explain analyze select * from tdummy where a = 1 and b = 1 and
> c = 1 and d = 1 and e = 1;
>                                                        QUERY PLAN
>
>
> -------------------------------------------------------------------------------------------------------------------------
>  Gather  (cost=100000.00..434572.10 rows=1 width=32) (actual
> time=0.547..292.466 rows=1 loops=1)
>    Workers Planned: 2
>    Workers Launched: 2
>    ->  Parallel Seq Scan on tdummy  (cost=0.00..334572.00 rows=1 width=32)
> (actual time=188.160..285.308 rows=0 loops=3)
>          Filter: ((a = 1) AND (b = 1) AND (c = 1) AND (d = 1) AND (e = 1))
>          Rows Removed by Filter: 6666666
>  Planning Time: 0.275 ms
>  Execution Time: 292.500 ms
> (8 rows)
>
> postgres=# set parallel_setup_cost = 1000000;
> SET
> postgres=# explain analyze select * from tdummy where a = 1 and b = 1 and
> c = 1 and d = 1 and e = 1;
>                                                        QUERY PLAN
>
>
> -------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on tdummy  (cost=506868.00..506872.02 rows=1 width=32)
> (actual time=54.085..54.092 rows=1 loops=1)
>    Recheck Cond: ((a = 1) AND (b = 1) AND (c = 1) AND (d = 1) AND (e = 1))
>    Rows Removed by Index Recheck: 1
>    Heap Blocks: exact=2
>    ->  Bitmap Index Scan on bloomidx2  (cost=0.00..506868.00 rows=1
> width=0) (actual time=54.047..54.047 rows=2 loops=1)
>          Index Cond: ((a = 1) AND (b = 1) AND (c = 1) AND (d = 1) AND (e =
> 1))
>  Planning Time: 0.165 ms
>  Execution Time: 54.847 ms
> (8 rows)
>
> Regards
> Daniel
>
> [1]
> https://www.postgresql.org/message-id/flat/ZR0P278MB0122119FAE78721A694C30C8D2340%40ZR0P278MB0122.CHEP278.PROD.OUTLOOK.COM
>
>
>

-- 
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH

Re: Bloom filters and the planner / parallel execution

Reply via email to