The possible solution of one scenario I can come up with so far is the query's predicate columns and group columns belonging to one table .
For a query that contains where clause, perhaps num_groups could be estimated according to the following formula. num_groups = ndistinct(pred_col_1, pred_col_2, ... pred_col_n) with where clause * ndistinct(pred_col_1, pred_col_2, ... pred_col_n, sort_col_1, sort_col_2, ... sort_col_m) / ndistinct(pred_col_1, pred_col_2, ... pred_col_n). ndistinct(pred_col_1, pred_col_2, ... pred_col_n) with where clause = ndistinct(pred_var_1, pred_var_2, ... pred_var_n) * selectivity of rel. pred_col_n belong to the columns involved in the where clause and sort_col_m belong to the columns involved in the group by clause. The reason for multiplying by selectivity of rel directly is that the selectivity of rel depends on only pred_col not sort_col. So the above formula can be simplified as follows. num_groups = ndistinct(pred_col_1, pred_col_2, ... pred_col_n, sort_col_1, sort_col_2, ... sort_col_m) * selectivity of rel. The correctness of the above formula depends on the following conditions. - ndistinct(pred_col_1, pred_col_2, ... pred_col_n)* ndistinct(pred_col_1, pred_col_2, ... pred_col_n, sort_col_1, sort_col_2, ... sort_col_m) statistics already exist, and need be accurate. - Both pred_col_n and sort_col_m are uniformly distributed, if not, statistics such as mcv are needed for correction. - The tuples of rel are the number of total tuples of the table , not the number of filtered tuples. After experimentation, in the scenario mentioned in previous thread. The estimate num_groups is 3, the accuracy of result strongly relies on the uniform distribution of b, which makes ndistinct(pred_col_1, pred_col_2, ... pred_col_n) with where clause could be able to estimated accurately. I'd like to hear your opinions. Regards. ywgrit. Tomas Vondra <tomas.von...@enterprisedb.com> 于2023年12月18日周一 20:53写道: > > > On 12/18/23 11:40, Richard Guo wrote: > > > > On Mon, Dec 18, 2023 at 7:31 AM Tomas Vondra > > <tomas.von...@enterprisedb.com <mailto:tomas.von...@enterprisedb.com>> > > wrote: > > > > Oh! Now I see what you meant by using the new formula in 84f9a35e3 > > depending on how we sum tuples. I agree that seems like the right > thing. > > > > I'm not sure it'll actually help with the issue, though - if I apply > the > > patch, the plan does not actually change (and the cost changes just a > > little bit). > > > > I looked at this only very briefly, but I believe it's due to the > > assumption of independence I mentioned earlier - we end up using the > new > > formula introduced in 84f9a35e3, but it assumes it assumes the > > selectivity and number of groups are independent. But that'd not the > > case here, because the groups are very clearly correlated (with the > > condition on "b"). > > > > > > You're right. The patch allows us to adjust the estimate of distinct > > values for appendrels using the new formula introduced in 84f9a35e3. > > But if the restrictions are correlated with the grouping expressions, > > the new formula does not behave well. That's why the patch does not > > help in case [1], where 'b' and 'c' are correlated. > > > > OTOH, if the restrictions are not correlated with the grouping > > expressions, the new formula would perform quite well. And in this case > > the patch would help a lot, as shown in [2] where estimate_num_groups() > > gives a much more accurate estimate with the help of this patch. > > > > So this patch could be useful in certain situations. I'm wondering if > > we should at least have this patch (if it is right). > > > > I do agree the patch seems to do the right thing, and it's worth pushing > on it's own. > > > > > If that's the case, I'm not sure how to fix this :-( > > > > > > The commit message of 84f9a35e3 says > > > > This could possibly be improved upon in the future by identifying > > correlated restrictions and using a hybrid of the old and new > > formulae. > > > > Maybe this is something we can consider trying. But anyhow this is not > > an easy task I suppose. > > Yeah, if it was easy, it'd have been done in 84f9a35e3 already ;-) > > The challenge is where to get usable information about correlation > between columns. I only have a couple very rought ideas of what might > try. For example, if we have multi-column ndistinct statistics, we might > look at ndistinct(b,c) and ndistinct(b,c,d) and deduce something from > > ndistinct(b,c,d) / ndistinct(b,c) > > If we know how many distinct values we have for the predicate column, we > could then estimate the number of groups. I mean, we know that for the > restriction "WHERE b = 3" we only have 1 distinct value, so we could > estimate the number of groups as > > 1 * ndistinct(b,c) > > I'm well aware this is only a very trivial example, and for more complex > examples it's likely way more complicated. But hopefully it illustrates > the general idea. > > The other idea would be to maybe look at multi-column MCV, and try using > it to deduce cross-column correlation too (it could be more accurate for > arbitrary predicates). > > And finally, we might try being a bit more pessimistic and look at what > the "worst case" behavior could be. So for example instead of trying to > estimate the real number of groups, we'd ask "What's the minimum number > of groups we're likely to get?". And use that to cost the incremental > sort. But I don't think we do this elsewhere, and I'm not sure we want > to start with this risk-based approach here. It might be OK, because we > usually expect the incremental sort to be much cheaper, ... > > If this something would be interested in exploring? I don't have > capacity to work on this myself, but I'd be available for reviews, > feedback and so on. > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > >