Re: planner chooses incremental but not the best one

ywgrit Fri, 22 Dec 2023 00:21:23 -0800

The possible solution of one scenario I can come up with so far is the
query's predicate columns and group columns belonging to one table .


For a query that contains where clause, perhaps num_groups could be
estimated according to the following formula.

num_groups = ndistinct(pred_col_1, pred_col_2, ... pred_col_n) with where
clause * ndistinct(pred_col_1, pred_col_2, ... pred_col_n, sort_col_1,
sort_col_2, ... sort_col_m) / ndistinct(pred_col_1, pred_col_2, ...
pred_col_n).

ndistinct(pred_col_1, pred_col_2, ... pred_col_n) with where clause =
ndistinct(pred_var_1, pred_var_2, ... pred_var_n) * selectivity of rel.

pred_col_n belong to the columns involved in the where clause and
sort_col_m belong to the columns involved in the group by clause.

The reason for multiplying by selectivity of rel directly is that the
selectivity of rel depends on only pred_col not sort_col. So the above
formula can be simplified as follows.

num_groups = ndistinct(pred_col_1, pred_col_2, ... pred_col_n, sort_col_1,
sort_col_2, ... sort_col_m) * selectivity of rel.

The correctness of the above formula depends on the following conditions.

   -

   ndistinct(pred_col_1, pred_col_2, ... pred_col_n)* ndistinct(pred_col_1,
   pred_col_2, ... pred_col_n, sort_col_1, sort_col_2, ... sort_col_m)
   statistics already exist, and need be accurate.
   -

   Both pred_col_n and sort_col_m are uniformly distributed, if not,
   statistics such as mcv are needed for correction.
   -

   The tuples of rel are the number of total tuples of the table , not the
   number of filtered tuples.

After experimentation, in the scenario mentioned in previous thread. The
estimate num_groups is 3, the accuracy of result strongly relies on the
uniform distribution of b, which makes ndistinct(pred_col_1, pred_col_2,
... pred_col_n) with where clause could be able to estimated accurately.

I'd like to hear your opinions.

Regards.

ywgrit.

Tomas Vondra <tomas.von...@enterprisedb.com> 于2023年12月18日周一 20:53写道：

>
>
> On 12/18/23 11:40, Richard Guo wrote:
> >
> > On Mon, Dec 18, 2023 at 7:31 AM Tomas Vondra
> > <tomas.von...@enterprisedb.com <mailto:tomas.von...@enterprisedb.com>>
> > wrote:
> >
> >     Oh! Now I see what you meant by using the new formula in 84f9a35e3
> >     depending on how we sum tuples. I agree that seems like the right
> thing.
> >
> >     I'm not sure it'll actually help with the issue, though - if I apply
> the
> >     patch, the plan does not actually change (and the cost changes just a
> >     little bit).
> >
> >     I looked at this only very briefly, but I believe it's due to the
> >     assumption of independence I mentioned earlier - we end up using the
> new
> >     formula introduced in 84f9a35e3, but it assumes it assumes the
> >     selectivity and number of groups are independent. But that'd not the
> >     case here, because the groups are very clearly correlated (with the
> >     condition on "b").
> >
> >
> > You're right.  The patch allows us to adjust the estimate of distinct
> > values for appendrels using the new formula introduced in 84f9a35e3.
> > But if the restrictions are correlated with the grouping expressions,
> > the new formula does not behave well.  That's why the patch does not
> > help in case [1], where 'b' and 'c' are correlated.
> >
> > OTOH, if the restrictions are not correlated with the grouping
> > expressions, the new formula would perform quite well.  And in this case
> > the patch would help a lot, as shown in [2] where estimate_num_groups()
> > gives a much more accurate estimate with the help of this patch.
> >
> > So this patch could be useful in certain situations.  I'm wondering if
> > we should at least have this patch (if it is right).
> >
>
> I do agree the patch seems to do the right thing, and it's worth pushing
> on it's own.
>
> >
> >     If that's the case, I'm not sure how to fix this :-(
> >
> >
> > The commit message of 84f9a35e3 says
> >
> >     This could possibly be improved upon in the future by identifying
> >     correlated restrictions and using a hybrid of the old and new
> >     formulae.
> >
> > Maybe this is something we can consider trying.  But anyhow this is not
> > an easy task I suppose.
>
> Yeah, if it was easy, it'd have been done in 84f9a35e3 already ;-)
>
> The challenge is where to get usable information about correlation
> between columns. I only have a couple very rought ideas of what might
> try. For example, if we have multi-column ndistinct statistics, we might
> look at ndistinct(b,c) and ndistinct(b,c,d) and deduce something from
>
>     ndistinct(b,c,d) / ndistinct(b,c)
>
> If we know how many distinct values we have for the predicate column, we
> could then estimate the number of groups. I mean, we know that for the
> restriction "WHERE b = 3" we only have 1 distinct value, so we could
> estimate the number of groups as
>
>     1 * ndistinct(b,c)
>
> I'm well aware this is only a very trivial example, and for more complex
> examples it's likely way more complicated. But hopefully it illustrates
> the general idea.
>
> The other idea would be to maybe look at multi-column MCV, and try using
> it to deduce cross-column correlation too (it could be more accurate for
> arbitrary predicates).
>
> And finally, we might try being a bit more pessimistic and look at what
> the "worst case" behavior could be. So for example instead of trying to
> estimate the real number of groups, we'd ask "What's the minimum number
> of groups we're likely to get?". And use that to cost the incremental
> sort. But I don't think we do this elsewhere, and I'm not sure we want
> to start with this risk-based approach here. It might be OK, because we
> usually expect the incremental sort to be much cheaper, ...
>
> If this something would be interested in exploring? I don't have
> capacity to work on this myself, but I'd be available for reviews,
> feedback and so on.
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>

Re: planner chooses incremental but not the best one

Reply via email to