Re: Wrong example in the bloom documentation

Bruce Momjian Mon, 26 Oct 2020 16:19:44 -0700

On Mon, Oct 26, 2020 at 05:04:09PM -0400, Bruce Momjian wrote:
> On Sat, Oct 17, 2020 at 01:50:26PM +0000, Daniel Westermann (DWE) wrote:
> > On Fri, Oct  9, 2020 at 11:08:32AM -0400, Bruce Momjian wrote:
> > >This is not applying to PG 12 or earlier because the patch mentions JIT,
> > >which was only mentioned in the PG bloom docs in PG 13+.
> > 
> > Does that mean we need separate patches for each release starting with 10? 
> > As I am not frequently writing patches, I would need some help here.
> 
> I can regenerate the output for older versions using your patch.
> However, I am confused about the parallelism you are seeing.  Your patch
> shows:
> 
>          Without the two indexes being created, a parallel sequential scan 
> will happen for the query below:
>                                                   -------------------
>       <programlisting>
>       =# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 
> 123451;
>                                                   QUERY PLAN
>       
> ---------------------------------------------------------------------------------------------------
>        Seq Scan on tbloom  (cost=0.00..214.00 rows=1 width=24) (actual 
> time=2.729..2.731 rows=0 loops=1)
>          Filter: ((i2 = 898732) AND (i5 = 123451))
>          Rows Removed by Filter: 10000
>        Planning Time: 0.257 ms
>        Execution Time: 2.764 ms
>       (5 rows)
> 
> However, I don't see any parallelism in this output.  Also, I don't see
> any parallelism once the indexes are created.  What PG version is this?
> and what config settings did you use?  Thanks.


I figured it out --- you have to use the larger generate_series value to
get the parallel output.  I have adjusted all the docs back to 9.6 to
show accurate output for that version, and simplified the query
ordering --- patch to master attached.  The other releases are similar. 
Daniel, please let me know if I have left out any details.

-- 
  Bruce Momjian  <[email protected]>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

diff --git a/doc/src/sgml/bloom.sgml b/doc/src/sgml/bloom.sgml
new file mode 100644
index 285b67b..d1cf9ac
*** a/doc/src/sgml/bloom.sgml
--- b/doc/src/sgml/bloom.sgml
*************** CREATE INDEX bloomidx ON tbloom USING bl
*** 110,184 ****
     FROM
    generate_series(1,10000000);
  SELECT 10000000
- =# CREATE INDEX bloomidx ON tbloom USING bloom (i1, i2, i3, i4, i5, i6);
- CREATE INDEX
- =# SELECT pg_size_pretty(pg_relation_size('bloomidx'));
-  pg_size_pretty
- ----------------
-  153 MB
- (1 row)
- =# CREATE index btreeidx ON tbloom (i1, i2, i3, i4, i5, i6);
- CREATE INDEX
- =# SELECT pg_size_pretty(pg_relation_size('btreeidx'));
-  pg_size_pretty
- ----------------
-  387 MB
- (1 row)
  </programlisting>
  
    <para>
     A sequential scan over this large table takes a long time:
  <programlisting>
  =# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
!                                                  QUERY PLAN
! -------------------------------------------------------------------&zwsp;-----------------------------------------
!  Seq Scan on tbloom  (cost=0.00..213694.08 rows=1 width=24) (actual time=1445.438..1445.438 rows=0 loops=1)
     Filter: ((i2 = 898732) AND (i5 = 123451))
!    Rows Removed by Filter: 10000000
!  Planning time: 0.177 ms
!  Execution time: 1445.473 ms
  (5 rows)
  </programlisting>
    </para>
  
    <para>
!    So the planner will usually select an index scan if possible.
!    With a btree index, we get results like this:
  <programlisting>
  =# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
!                                                            QUERY PLAN
! -------------------------------------------------------------------&zwsp;-------------------------------------------------------------
!  Index Only Scan using btreeidx on tbloom  (cost=0.56..298311.96 rows=1 width=24) (actual time=445.709..445.709 rows=0 loops=1)
!    Index Cond: ((i2 = 898732) AND (i5 = 123451))
!    Heap Fetches: 0
!  Planning time: 0.193 ms
!  Execution time: 445.770 ms
  (5 rows)
  </programlisting>
    </para>
  
    <para>
!    Bloom is better than btree in handling this type of search:
  <programlisting>
  =# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
!                                                         QUERY PLAN
! -------------------------------------------------------------------&zwsp;--------------------------------------------------------
!  Bitmap Heap Scan on tbloom  (cost=178435.39..178439.41 rows=1 width=24) (actual time=76.698..76.698 rows=0 loops=1)
     Recheck Cond: ((i2 = 898732) AND (i5 = 123451))
!    Rows Removed by Index Recheck: 2439
!    Heap Blocks: exact=2408
!    -&gt;  Bitmap Index Scan on bloomidx  (cost=0.00..178435.39 rows=1 width=0) (actual time=72.455..72.455 rows=2439 loops=1)
           Index Cond: ((i2 = 898732) AND (i5 = 123451))
!  Planning time: 0.475 ms
!  Execution time: 76.778 ms
  (8 rows)
  </programlisting>
-    Note the relatively large number of false positives: 2439 rows were
-    selected to be visited in the heap, but none actually matched the
-    query.  We could reduce that by specifying a larger signature length.
-    In this example, creating the index with <literal>length=200</literal>
-    reduced the number of false positives to 55; but it doubled the index size
-    (to 306 MB) and ended up being slower for this query (125 ms overall).
    </para>
  
    <para>
--- 110,179 ----
     FROM
    generate_series(1,10000000);
  SELECT 10000000
  </programlisting>
  
    <para>
     A sequential scan over this large table takes a long time:
  <programlisting>
  =# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
!                                               QUERY PLAN                                              
! -------------------------------------------------------------------&zwsp;-----------------------------------
!  Seq Scan on tbloom  (cost=0.00..2137.14 rows=3 width=24) (actual time=16.971..16.971 rows=0 loops=1)
     Filter: ((i2 = 898732) AND (i5 = 123451))
!    Rows Removed by Filter: 100000
!  Planning Time: 0.346 ms
!  Execution Time: 16.988 ms
  (5 rows)
  </programlisting>
    </para>
  
    <para>
!    Even with the btree index defined the result will still be a
!    sequential scan:
  <programlisting>
+ =# CREATE INDEX btreeidx ON tbloom (i1, i2, i3, i4, i5, i6);
+ CREATE INDEX
+ =# SELECT pg_size_pretty(pg_relation_size('btreeidx'));
+  pg_size_pretty
+ ----------------
+  3976 kB
+ (1 row)
  =# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
!                                               QUERY PLAN                                              
! -------------------------------------------------------------------&zwsp;-----------------------------------
!  Seq Scan on tbloom  (cost=0.00..2137.00 rows=2 width=24) (actual time=12.805..12.805 rows=0 loops=1)
!    Filter: ((i2 = 898732) AND (i5 = 123451))
!    Rows Removed by Filter: 100000
!  Planning Time: 0.138 ms
!  Execution Time: 12.817 ms
  (5 rows)
  </programlisting>
    </para>
  
    <para>
!    Having the bloom index defined on the table is better than btree in
!    handling this type of search:
  <programlisting>
+ =# CREATE INDEX bloomidx ON tbloom USING bloom (i1, i2, i3, i4, i5, i6);
+ CREATE INDEX
+ =# SELECT pg_size_pretty(pg_relation_size('bloomidx'));
+  pg_size_pretty
+ ----------------
+  1584 kB
+ (1 row)
  =# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
!                                                      QUERY PLAN                                                      
! -------------------------------------------------------------------&zwsp;--------------------------------------------------
!  Bitmap Heap Scan on tbloom  (cost=1792.00..1799.69 rows=2 width=24) (actual time=0.388..0.388 rows=0 loops=1)
     Recheck Cond: ((i2 = 898732) AND (i5 = 123451))
!    Rows Removed by Index Recheck: 29
!    Heap Blocks: exact=28
!    -&gt;  Bitmap Index Scan on bloomidx  (cost=0.00..1792.00 rows=2 width=0) (actual time=0.356..0.356 rows=29 loops=1)
           Index Cond: ((i2 = 898732) AND (i5 = 123451))
!  Planning Time: 0.099 ms
!  Execution Time: 0.408 ms
  (8 rows)
  </programlisting>
    </para>
  
    <para>
*************** CREATE INDEX
*** 187,210 ****
     A better strategy for btree is to create a separate index on each column.
     Then the planner will choose something like this:
  <programlisting>
  =# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
!                                                           QUERY PLAN
! -------------------------------------------------------------------&zwsp;-----------------------------------------------------------
!  Bitmap Heap Scan on tbloom  (cost=9.29..13.30 rows=1 width=24) (actual time=0.148..0.148 rows=0 loops=1)
     Recheck Cond: ((i5 = 123451) AND (i2 = 898732))
!    -&gt;  BitmapAnd  (cost=9.29..9.29 rows=1 width=0) (actual time=0.145..0.145 rows=0 loops=1)
!          -&gt;  Bitmap Index Scan on tbloom_i5_idx  (cost=0.00..4.52 rows=11 width=0) (actual time=0.089..0.089 rows=10 loops=1)
                 Index Cond: (i5 = 123451)
!          -&gt;  Bitmap Index Scan on tbloom_i2_idx  (cost=0.00..4.52 rows=11 width=0) (actual time=0.048..0.048 rows=8 loops=1)
                 Index Cond: (i2 = 898732)
!  Planning time: 2.049 ms
!  Execution time: 0.280 ms
  (9 rows)
  </programlisting>
     Although this query runs much faster than with either of the single
!    indexes, we pay a large penalty in index size.  Each of the single-column
!    btree indexes occupies 214 MB, so the total space needed is over 1.2GB,
!    more than 8 times the space used by the bloom index.
    </para>
   </sect2>
  
--- 182,217 ----
     A better strategy for btree is to create a separate index on each column.
     Then the planner will choose something like this:
  <programlisting>
+ =# CREATE INDEX btreeidx1 ON tbloom (i1);
+ CREATE INDEX
+ =# CREATE INDEX btreeidx2 ON tbloom (i2);
+ CREATE INDEX
+ =# CREATE INDEX btreeidx3 ON tbloom (i3);
+ CREATE INDEX
+ =# CREATE INDEX btreeidx4 ON tbloom (i4);
+ CREATE INDEX
+ =# CREATE INDEX btreeidx5 ON tbloom (i5);
+ CREATE INDEX
+ =# CREATE INDEX btreeidx6 ON tbloom (i6);
+ CREATE INDEX
  =# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
!                                                         QUERY PLAN                                                         
! -------------------------------------------------------------------&zwsp;--------------------------------------------------------
!  Bitmap Heap Scan on tbloom  (cost=24.34..32.03 rows=2 width=24) (actual time=0.028..0.029 rows=0 loops=1)
     Recheck Cond: ((i5 = 123451) AND (i2 = 898732))
!    -&gt;  BitmapAnd  (cost=24.34..24.34 rows=2 width=0) (actual time=0.027..0.027 rows=0 loops=1)
!          -&gt;  Bitmap Index Scan on btreeidx5  (cost=0.00..12.04 rows=500 width=0) (actual time=0.026..0.026 rows=0 loops=1)
                 Index Cond: (i5 = 123451)
!          -&gt;  Bitmap Index Scan on btreeidx2  (cost=0.00..12.04 rows=500 width=0) (never executed)
                 Index Cond: (i2 = 898732)
!  Planning Time: 0.491 ms
!  Execution Time: 0.055 ms
  (9 rows)
  </programlisting>
     Although this query runs much faster than with either of the single
!    indexes, we pay a penalty in index size.  Each of the single-column
!    btree indexes occupies 2 MB, so the total space needed is 12 MB,
!    eight times the space used by the bloom index.
    </para>
   </sect2>

Re: Wrong example in the bloom documentation

Reply via email to