On Tue, Jun 5, 2012 at 6:24 PM, Mike Christensen <m...@kitchenpc.com> wrote:
> Hi - > > I'm trying to increase my general knowledge about how indexes work in > databases. Though my questions are probably general and implemented > in a similar way across major relational DBs, I'm also curious as to > how they're implemented in Postgres specifically (mainly because I > like PG, and am always interested in knowing if PG does things in some > cool and interesting way). > Quick! Create some test data! drop table if exists foobar; create table foobar ( a int not null primary key , b int null , c int null , d int null ); create index b_idx on foobar (b); create index c_idx on foobar (c); create index d_idx on foobar (d); create or replace function generate_test_data() returns void as $$ declare i integer; begin for i in 1..100000 loop insert into foobar (a, b, c, d) values (i, i%2, i%10, i%1000); end loop; end; $$ language plpgsql; select generate_test_data(); vacuum analyze foobar; > > I know the basics of how binary trees work, so I understand a query such > as : > > select * from Table where Id = 5; > > Provided Id has a btree index on it. sometimes. Sometimes not. explain analyze select * from foobar where a = 1234; QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Index Scan using foobar_pkey on foobar (cost=0.00..8.38 rows=1 width=16) (actual time=0.008..0.008 rows=1 loops=1) Index Cond: (a = 1234) Total runtime: 0.030 ms (3 rows) explain analyze select * from foobar where b = 1; QUERY PLAN ------------------------------------------------------------------------------------------------------------- Seq Scan on foobar (cost=0.00..1791.00 rows=50270 width=16) (actual time=0.011..13.603 rows=50000 loops=1) Filter: (b = 1) Rows Removed by Filter: 50000 Total runtime: 16.005 ms (4 rows) > I'm curious as to how indexes > are used with OR and AND clauses. > > Something like: > > select * from Table where X = 5 or y = 3; > > It seems to me both the index of X would be scanned and those rows > would be loaded into memory, and then the index of Y would be scanned > and loaded. Then, Postgres would have to merge both sets into a set > of unique rows. Is this pretty much what's going on? Let's ignore > table stats for now. > The right answer is "sometimes". Here's a query that is solved the way you expect: explain analyze select * from foobar where c = 1 or d = 1; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on foobar (cost=226.47..920.65 rows=10202 width=16) (actual time=1.284..3.784 rows=10000 loops=1) Recheck Cond: ((c = 1) OR (d = 1)) -> BitmapOr (cost=226.47..226.47 rows=10212 width=0) (actual time=1.174..1.174 rows=0 loops=1) -> Bitmap Index Scan on c_idx (cost=0.00..216.23 rows=10113 width=0) (actual time=1.157..1.157 rows=10000 loops=1) Index Cond: (c = 1) -> Bitmap Index Scan on d_idx (cost=0.00..5.13 rows=98 width=0) (actual time=0.016..0.016 rows=100 loops=1) Index Cond: (d = 1) Total runtime: 4.452 ms (8 rows) And here is a query that is not solved the way you expect, because the index on B is not useful. explain analyze select * from foobar where c = 1 or b = 1; QUERY PLAN ------------------------------------------------------------------------------------------------------------- Seq Scan on foobar (cost=0.00..2041.00 rows=55299 width=16) (actual time=0.007..14.922 rows=50000 loops=1) Filter: ((c = 1) OR (b = 1)) Rows Removed by Filter: 50000 Total runtime: 17.002 ms (4 rows) > > Then, something like: > > select * from Table where X = 5 AND y = 3; > > I would imagine the same thing is going on, only Postgres would find > rows that appear in both sets. I also imagine Postgres might create a > hash table from the larger set, and then iterate through the smaller > set looking for rows that were in that hash table and you would be largely right about that. Interestingly, on an earlier run of this, I got a BitmapAnd strategy, rather than applying the c=1 filter to the rows after identifying all the d=1 rows. explain analyze select * from foobar where c = 1 and d = 1; QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on foobar (cost=5.13..256.48 rows=10 width=16) (actual time=0.046..0.150 rows=100 loops=1) Recheck Cond: (d = 1) Filter: (c = 1) -> Bitmap Index Scan on d_idx (cost=0.00..5.13 rows=98 width=0) (actual time=0.026..0.026 rows=100 loops=1) Index Cond: (d = 1) Total runtime: 0.179 ms (6 rows) > > Lastly, If you had a query such as: > > select * from Table where X IN (1,2,3,4,5,6,7); > > I would imagine Postgres would parse that query as a bunch of OR > clauses. Does this mean the index for X would be scanned 7 times and > merged into a set of unique results? Though, obviously if Postgres > estimated this would return the majority of the rows in the table, it > would probably just ignore the index completely. > and you would be right on both counts explain analyze select * from foobar where c in (1, 2, 3); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on foobar (cost=609.14..1562.18 rows=29967 width=16) (actual time=3.456..7.589 rows=30000 loops=1) Recheck Cond: (c = ANY ('{1,2,3}'::integer[])) -> Bitmap Index Scan on c_idx (cost=0.00..601.64 rows=29967 width=0) (actual time=3.342..3.342 rows=30000 loops=1) Index Cond: (c = ANY ('{1,2,3}'::integer[])) Total runtime: 9.083 ms (5 rows) explain analyze select * from foobar where c in (1, 2, 3, 4, 5, 6); QUERY PLAN ------------------------------------------------------------------------------------------------------------- Seq Scan on foobar (cost=0.00..2291.00 rows=59723 width=16) (actual time=0.005..18.450 rows=60000 loops=1) Filter: (c = ANY ('{1,2,3,4,5,6}'::integer[])) Rows Removed by Filter: 40000 Total runtime: 21.181 ms (4 rows) > Thanks! > Mike > >