Hi Tomas and others, 2015-03-02 21:29 GMT+01:00 Tomas Vondra <tomas.von...@2ndquadrant.com>:
> Hi David ;-) > > On 2.3.2015 20:19, David Kubečka wrote: > > > > The question is why optimizer, or rather the cost estimator, > > produced so much different estimates upon very small change in input. > > Moreover it seems that the main culprit of bad estimates isn't > > actually directly related to outer table, but rather to inner table. > > Just compare estimates for the two index scans: > > > > With 'random_fk_dupl': > > -> Index Scan using facts_fk_idx on facts (cost=0.42..5.75 > > rows=100 width=15) (actual time=0.009..0.117 rows=98 loops=100) > > With 'random_fk_uniq': > > -> Index Scan using facts_fk_idx on facts (cost=0.42..214.26 > > rows=100 width=15) (actual time=0.007..0.109 rows=98 loops=100) > > > > I have read the optimizer README file and also looked briefly at the > > code, but this seems to be something not related to particular > > implementation of algorithm (e.g. nested loop). Perhaps it's the way > > how cost estimates are propagated down (or sideways? that would be > > weird...) the query tree. But I am really not sure, since this is my > > first time lookng at the optimizer code base. I should also add that > > I have reproduced this behaviour for all versions of Pg from 9.2 up > > to current devel. > > Interesting. I've only spent a few minutes looking at this, but it > certainly is a bit strange that using smaller table with unique values > results in a slower plan than using a table with duplicities (but the > same number of unique values). > Yep. And I think that I have found the cause of the strangeness. I have traced the planner code and this is how cost_index from costsize.c is called when computing the cost of index scan on outer (facts) table, when the inner table is with duplicate entries (random_fk_dupl): cost_index (path=0x27f8fb8, root=0x27670e8, loop_count=9920) The important thing here is the value of loop_count which is the (perfect estimate of) number of rows of random_fk_dupl (I have recreated the table so it differs slightly from previously posted version (9893)). Now the predominant part of running cost is computed by this expression: run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost); Since csquared (indexCorrelation * indexCorrelation) is very small in this case, the biggest term here is max_IO_cost, which is computed as follows: max_IO_cost = (pages_fetched * spc_random_page_cost) / loop_count; There is division by loop_count because of predicted effect of caching and it is exactly this division which makes the run_cost for single index lookup so low compared with the query version with random_fk_uniq. So the main problem is that the planner calls cost_index with loop_count equal to number of rows of inner table, *but it ignores semi-join semantics*, i.e. it doesn't account for the number of *unique* rows in the inner table which will actually be the number of loops. Since this is my first time looking into the optimizer (and in fact any postgres) code I am not yet able to locate the exact place where this should be repaired, but I hope that in few days I will have a patch :-) > > Another observation is that the planner does see other plans, because > tweaking the cost variables like this: > > SET random_page_cost = 2; > > produces this plan (which on my machine runs just a tad slower than the > first query in your post): > > QUERY PLAN > --------------------------------------------------------------------- > HashAggregate (cost=10564.81..10688.47 rows=9893 width=15) > Group Key: facts.fk > -> Nested Loop (cost=5.42..10515.34 rows=9893 width=15) > -> HashAggregate (cost=2.24..3.23 rows=99 width=4) > Group Key: random_fk_uniq.fk > -> Seq Scan on random_fk_uniq (cost=0.00..1.99 ...) > -> Bitmap Heap Scan on facts (cost=3.18..105.18 rows=100 ...) > Recheck Cond: (fk = random_fk_uniq.fk) > -> Bitmap Index Scan on facts_fk_idx (cost=0.00..3.15 ...) > Index Cond: (fk = random_fk_uniq.fk) > Yeah, this makes now perfect sense to me. From the run_cost and max_IO_cost formulas you can see that (in this case!) random_page_cost is almost exactly the linear factor here. So eventually the results are completely opposite than I wished at first, i.e. instead of tweaking postgres to produce lower costs for random_fk_uniq it will start producing higher costs for random_fk_dupl. But I now believe that this is correct. Regards, -David > I can get similar results by setting cpu_operator_cost=0.005. And > further lowering to random_page_cost to 1.1 actually gets me this: > > QUERY PLAN > --------------------------------------------------------------------- > HashAggregate (cost=6185.41..6309.08 rows=9893 width=15) > Group Key: facts.fk > -> Nested Loop (cost=2.66..6135.95 rows=9893 width=15) > -> HashAggregate (cost=2.24..3.23 rows=99 width=4) > Group Key: random_fk_uniq.fk > -> Seq Scan on random_fk_uniq (cost=0.00..1.99 ...) > -> Index Scan using facts_fk_idx on facts (cost=0.42..60.95 ...) > Index Cond: (fk = random_fk_uniq.fk) > > which is exactly the first plan from your post. > > I still don't understand why using the smaller table produces less > efficient plan, but the estimated costs are very clost (20013 vs. 18147 > is very small difference in this context), and the estimator shoots for > minimal average error. So it may seem 'weirdly pessimistic' but it might > be equally optimistic for other queries. > > Also, keep in mind that the cost model is just a simplification of > reality, so it can't be correct 100% of the time. > > The fact that this small difference in costs results in significant > duration difference suggests that the default optimizer cost values do > not reflect your environment. For example, you assume that this is very > fast: > > NestLoop > -> Seq Scan on A > -> Index Scan using B_Y_IDX on B > Index Condition: B.Y = A.X > > but index scans often produce a lot of random I/O, and that may be quite > expensive if it really hits the storage. And that's what the default > values (random_page_cost=4) is set for. But if you're on a system with > lots of RAM, or SSD, ... that simply is not the case, and may penalize > index scans (and prefer more sequential workloads). > > The other thing you might do is creating index on (fk,f), which on the > new releases produces this: > > QUERY PLAN > ------------------------------------------------------------------------- > HashAggregate (cost=783.77..907.43 rows=9893 width=15) > Group Key: facts.fk > -> Nested Loop (cost=2.66..734.30 rows=9893 width=15) > -> HashAggregate (cost=2.24..3.23 rows=99 width=4) > Group Key: random_fk_uniq.fk > -> Seq Scan on random_fk_uniq (cost=0.00..1.99 ...) > -> Index Only Scan using facts_2_idx on facts (cost=...) > Index Cond: (fk = random_fk_uniq.fk) > > Which limits the amount of random I/O by only scanning the index, and is > even faster than the first query. > > regard > > -- > Tomas Vondra http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >