--On torsdag, januari 13, 2005 18.55.11 -0500 Tom Lane <[EMAIL PROTECTED]> wrote:

Palle Girgensohn <[EMAIL PROTECTED]> writes:
--On torsdag, januari 13, 2005 18.18.37 -0500 Tom Lane
<[EMAIL PROTECTED]>  wrote:
So there's something nuts about the statistics in this case.

On looking into it, it's the same old issue of not having column correlation statistics. pg_stats shows that ANALYZE estimated the fraction of rows with null group_id as 0.137667 (versus exact value of 0.147, not too bad considering I used the default statistics target) and it estimated the fraction with this_group_id = 46 as 0.358 (vs actual 0.369, ditto). The problem is that it then estimates the total selectivity as 0.137667 * 0.358 or a bit under 5%, much too high to make an indexscan sensible. In reality there are only 4 rows with this combination of values, but the planner has no way to know that.

Anything I can do about it?

I thought of a fairly miserable hack, which relies on the fact that 8.0 does know how to accumulate statistics on functional indexes:

group=# create index fooi on group_data (abs(this_group_id)) WHERE
group_id IS NULL; CREATE INDEX
group=# analyze group_data;
ANALYZE
group=# explain select * from group_data where group_id is null and
abs(this_group_id) = 46;                                   QUERY PLAN
-------------------------------------------------------------------------
-----  Index Scan using fooi on group_data  (cost=0.00..5302.60 rows=1802
width=42)    Index Cond: (abs(this_group_id) = 46)
   Filter: (group_id IS NULL)
(3 rows)

(The choice of abs() is arbitrary, it just has to be something other
than the unadorned column.)  In this situation the planner will look at
the stats for the functional index and discover that in that index there
aren't many 46's, so it comes out with a more reasonable rowcount
estimate.

OK, I think I understand. And this is changed between 7.4.x and 8.0?

We should probably make it accumulate stats on partial indexes even when
the index columns aren't expressions.  This example shows that useful
stats can be derived that way.  Too late for 8.0 though...

True, but for next version, perhaps? :)

Trying all this out, I realize that on 7.4.5, I can sometimes get different results after `vacuum analyze' vs. a plain `analyze' (again, not exactly the same data, and I cannot reproduce this on the other machine with the data I sent you). It does not really relate to the question above, but perhaps you can explain how come I get different results?

I join with a table person, group_data.item_text has person.userid as foreign key constraint:

7.4.5:

pp=# vacuum analyze group_data;
VACUUM
Time: 256353,802 ms
pp=# select
pp-#   distinct p.last_name,
pp-#   p.userid
pp-#  from
pp-#   group_data gd join person p on (p.userid = gd.item_text)
pp-#  where
pp-#   gd.this_group_id = 46
pp-#   and gd.group_id is null;
last_name |  userid
-----------+----------
Lastname  | u1wmd5nn
(1 row)

Time: 6223,123 ms
pp=# explain analyze
pp-# select
pp-# distinct p.last_name,
pp-# p.userid
pp-# from
pp-# group_data gd join person p on (p.userid = gd.item_text)
pp-# where
pp-# gd.this_group_id = 46
pp-# and gd.group_id is null;
QUERY PLAN


----------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=76016.57..77215.19 rows=92632 width=23) (actual time=7649.496..7649.512 rows=1 loops=1)
-> Sort (cost=76016.57..76416.11 rows=159816 width=23) (actual time=7649.481..7649.487 rows=1 loops=1)
Sort Key: p.last_name, p.userid
-> Hash Join (cost=3003.90..62203.64 rows=159816 width=23) (actual time=7649.254..7649.435 rows=1 loops=1)
Hash Cond: ("outer".item_text = "inner".userid)
-> Seq Scan on group_data gd (cost=0.00..53238.10 rows=160565 width=12) (actual time=431.078..5927.410 rows=5 loops=1)
Filter: ((this_group_id = 46) AND (group_id IS NULL))
-> Hash (cost=2229.32..2229.32 rows=92632 width=23) (actual time=1555.797..1555.797 rows=0 loops=1)
-> Seq Scan on person p (cost=0.00..2229.32 rows=92632 width=23) (actual time=0.093..856.728 rows=92632 loops=1)
Total runtime: 7652.771 ms
(10 rows)


Time: 7656,909 ms
pp=# select * from group_data where this_group_id=46 and group_id is null;
this_group_id | group_id | item_text | item_int | link_path
---------------+----------+-----------+----------+-----------
           46 |          |           |     1223 | :46:
           46 |          |           |     1228 | :46:
           46 |          |           |     1328 | :46:
           46 |          |           |     1391 | :46:
           46 |          | u1wmd5nn  |          | :46:
(5 rows)

Time: 5891,716 ms
pp=# analyze group_data;
ANALYZE
Time: 3210,096 ms
pp=# explain select
pp-# distinct p.last_name,
pp-# p.userid
pp-# from
pp-# group_data gd join person p on (p.userid = gd.item_text)
pp-# where
pp-# gd.this_group_id = 46
pp-# and gd.group_id is null;
QUERY PLAN


-----------------------------------------------------------------------------------------------------------------------------
Unique (cost=111417.47..113761.30 rows=92632 width=23)
-> Sort (cost=111417.47..112198.75 rows=312510 width=23)
Sort Key: p.last_name, p.userid
-> Hash Join (cost=3003.90..79231.40 rows=312510 width=23)
Hash Cond: ("outer".item_text = "inner".userid)
-> Index Scan using group_data_tgid_gidnull_idx on group_data gd (cost=0.00..65091.35 rows=275225 width=11)
Index Cond: (this_group_id = 46)
Filter: (group_id IS NULL)
-> Hash (cost=2229.32..2229.32 rows=92632 width=23)
-> Seq Scan on person p (cost=0.00..2229.32 rows=92632 width=23)
(10 rows)


Time: 6,647 ms
pp=# select
pp-#   distinct p.last_name,
pp-#   p.userid
pp-#  from
pp-#   group_data gd join person p on (p.userid = gd.item_text)
pp-#  where
pp-#   gd.this_group_id = 46
pp-#   and gd.group_id is null;
last_name |  userid
-----------+----------
Lastname  | u1wmd5nn
(1 row)

Time: 772,969 ms
pp=# select
pp-#   distinct p.last_name,
pp-#   p.userid
pp-#  from
pp-#   person p, group_data gd
pp-#  where
pp-#   p.userid = gd.item_text
pp-#   and gd.this_group_id = 46
pp-#   and gd.group_id is null
pp-# ;
last_name |  userid
-----------+----------
Lastname  | u1wmd5nn
(1 row)

Time: 720,345 ms


/Palle



---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
     joining column's datatypes do not match

Reply via email to