Re: [GENERAL] I'd like to learn a bit more about how indexes work

Chris Curvey Tue, 05 Jun 2012 19:21:37 -0700

On Tue, Jun 5, 2012 at 6:24 PM, Mike Christensen <m...@kitchenpc.com> wrote:


> Hi -
>
> I'm trying to increase my general knowledge about how indexes work in
> databases.  Though my questions are probably general and implemented
> in a similar way across major relational DBs, I'm also curious as to
> how they're implemented in Postgres specifically (mainly because I
> like PG, and am always interested in knowing if PG does things in some
> cool and interesting way).
>

Quick!   Create some test data!

drop table if exists foobar;
create table foobar
(  a int not null primary key
,  b int null
,  c int null
,  d int null
);
create index b_idx on foobar (b);
create index c_idx on foobar (c);
create index d_idx on foobar (d);

create or replace function generate_test_data() returns void as $$
declare
  i integer;
begin
  for i in 1..100000 loop
     insert into foobar (a, b, c, d) values (i, i%2, i%10, i%1000);
  end loop;
end;
$$ language plpgsql;

select generate_test_data();

vacuum analyze foobar;


>
> I know the basics of how binary trees work, so I understand a query such
> as :
>
> select * from Table where Id = 5;
>
> Provided Id has a btree index on it.


sometimes.  Sometimes not.

explain analyze
select * from foobar
where a = 1234;
                                                     QUERY PLAN

---------------------------------------------------------------------------------------------------------------------
 Index Scan using foobar_pkey on foobar  (cost=0.00..8.38 rows=1 width=16)
(actual time=0.008..0.008 rows=1 loops=1)
   Index Cond: (a = 1234)
 Total runtime: 0.030 ms
(3 rows)

explain analyze
select *
from foobar
where b = 1;
                                                 QUERY PLAN

-------------------------------------------------------------------------------------------------------------
 Seq Scan on foobar  (cost=0.00..1791.00 rows=50270 width=16) (actual
time=0.011..13.603 rows=50000 loops=1)
   Filter: (b = 1)
   Rows Removed by Filter: 50000
 Total runtime: 16.005 ms
(4 rows)


>  I'm curious as to how indexes
> are used with OR and AND clauses.
>
> Something like:
>
> select * from Table where X = 5 or y = 3;
>
> It seems to me both the index of X would be scanned and those rows
> would be loaded into memory, and then the index of Y would be scanned
> and loaded.  Then, Postgres would have to merge both sets into a set
> of unique rows.  Is this pretty much what's going on?  Let's ignore
> table stats for now.
>

The right answer is "sometimes".  Here's a query that is solved the way you
expect:

explain analyze
select *
from foobar
where c = 1
or d = 1;
                                                          QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on foobar  (cost=226.47..920.65 rows=10202 width=16)
(actual time=1.284..3.784 rows=10000 loops=1)
   Recheck Cond: ((c = 1) OR (d = 1))
   ->  BitmapOr  (cost=226.47..226.47 rows=10212 width=0) (actual
time=1.174..1.174 rows=0 loops=1)
         ->  Bitmap Index Scan on c_idx  (cost=0.00..216.23 rows=10113
width=0) (actual time=1.157..1.157 rows=10000 loops=1)
               Index Cond: (c = 1)
         ->  Bitmap Index Scan on d_idx  (cost=0.00..5.13 rows=98 width=0)
(actual time=0.016..0.016 rows=100 loops=1)
               Index Cond: (d = 1)
 Total runtime: 4.452 ms
(8 rows)

And here is a query that is not solved the way you expect, because the
index on B is not useful.

explain analyze
select *
from foobar
where c = 1
or b = 1;
                                                 QUERY PLAN

-------------------------------------------------------------------------------------------------------------
 Seq Scan on foobar  (cost=0.00..2041.00 rows=55299 width=16) (actual
time=0.007..14.922 rows=50000 loops=1)
   Filter: ((c = 1) OR (b = 1))
   Rows Removed by Filter: 50000
 Total runtime: 17.002 ms
(4 rows)



>
> Then, something like:
>
> select * from Table where X = 5 AND y = 3;
>
> I would imagine the same thing is going on, only Postgres would find
> rows that appear in both sets.  I also imagine Postgres might create a
> hash table from the larger set, and then iterate through the smaller
> set looking for rows that were in that hash table


and you would be largely right about that.  Interestingly, on an earlier
run of this, I got a BitmapAnd strategy, rather than applying the c=1
filter to the rows after identifying all the d=1 rows.

explain analyze
select *
from foobar
where c = 1
and d = 1;
                                                   QUERY PLAN

-----------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on foobar  (cost=5.13..256.48 rows=10 width=16) (actual
time=0.046..0.150 rows=100 loops=1)
   Recheck Cond: (d = 1)
   Filter: (c = 1)
   ->  Bitmap Index Scan on d_idx  (cost=0.00..5.13 rows=98 width=0)
(actual time=0.026..0.026 rows=100 loops=1)
         Index Cond: (d = 1)
 Total runtime: 0.179 ms
(6 rows)



>
> Lastly, If you had a query such as:
>
> select * from Table where X IN (1,2,3,4,5,6,7);
>
> I would imagine Postgres would parse that query as a bunch of OR
> clauses.  Does this mean the index for X would be scanned 7 times and
> merged into a set of unique results?  Though, obviously if Postgres
> estimated this would return the majority of the rows in the table, it
> would probably just ignore the index completely.
>

 and you would be right on both counts

explain analyze
select *
from foobar
where c in (1, 2, 3);
                                                       QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on foobar  (cost=609.14..1562.18 rows=29967 width=16)
(actual time=3.456..7.589 rows=30000 loops=1)
   Recheck Cond: (c = ANY ('{1,2,3}'::integer[]))
   ->  Bitmap Index Scan on c_idx  (cost=0.00..601.64 rows=29967 width=0)
(actual time=3.342..3.342 rows=30000 loops=1)
         Index Cond: (c = ANY ('{1,2,3}'::integer[]))
 Total runtime: 9.083 ms
(5 rows)

explain analyze
select *
from foobar
where c in (1, 2, 3, 4, 5, 6);
                                                 QUERY PLAN

-------------------------------------------------------------------------------------------------------------
 Seq Scan on foobar  (cost=0.00..2291.00 rows=59723 width=16) (actual
time=0.005..18.450 rows=60000 loops=1)
   Filter: (c = ANY ('{1,2,3,4,5,6}'::integer[]))
   Rows Removed by Filter: 40000
 Total runtime: 21.181 ms
(4 rows)




> Thanks!
> Mike
>
>

Re: [GENERAL] I'd like to learn a bit more about how indexes work

Reply via email to