Thank you very much for review.

On 25.03.2020 20:04, Rafia Sabih wrote:

+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
+

This doesn't look like the right place for it, you might want to
declare it with other functions in the starting of the file.

Also, there is no description about any of the functions here,
wouldn’t hurt having some more comments there.

Sorry, I will fix it.
Actually this patch contains of two independent parts:
first allows to use auto_explain extension to generate mutlicolumn statistic for variables used in clauses for which selectivity estimation gives wrong result. It affects only auto_explain extension.

Second part allows to use multicolumn statistic for join selectivity estimation.
As far as I know extended statistic is now actively improved:
https://www.postgresql.org/message-id/flat/20200309000157.ig5tcrynvaqu4ixd%40development#bfbdf9c41c31ef92819dfc5ecde4a67c

I think that using extended statistic for join selectivity is very important and should also be addressed.
If my approach is on so good, I will be pleased for other suggestions.


A few of more questions that cross my mind at this point,

- have you tried measuring the extra cost we have to pay for this
mores statistics , and also compare it with the benefit it gives in
terms of accuracy.
Adding statistic not always leads to performance improvement but I never observed any performance degradation caused by presence of extended statistic. Definitely we can manually create too many extended statistic entries for different subsets of columns. And it certainly increase planning time because optimizer has to consider more alternatives.
But in practice I never noticed such slowdown.

- I would also be interested in understanding if there are cases when
adding this extra step doesn’t help and have you excluded them already
or if some of them are easily identifiable at this stage...?

Unfortunately there are many cases when extended statistic can not help.
Either because optimizer is not able to use it (for example my patch consider only cases with strict equality comparison, but if you use predicate like "a.x=b.x and  a.y in (1,2,3)"  then extended statistic for <x,y> can not be used. Either because collected statistic itself is not precise enough , especially in case of data skews.


- is there any limit  on the number of columns for which this will
work, or should there be any such limit...?

Right now there is limit for maximal number of columns used in extended statistic: 8 columns.
But in practice I rarely see join predicates involving more than 3 columns.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Reply via email to