On Wed, Jan 15, 2025 at 1:58 AM Richard Guo <guofengli...@gmail.com> wrote: > I understand that we're currently quite bad at estimating the number > of groups after aggregation. In fact, it's not just aggregation > estimates — we're also bad at join estimates in some cases. This is a > reality we have to face. Here's what I think: we should be trying our > best to cost each node type as accurately as possible, and then build > the upper nodes based on those costs. We should not conclude that, > because we are unable to accurately cost one node type, we should > avoid any cost-based optimizations above that node.
Well, I agree with that last sentence, for sure. But I don't think it's true that the situations with joins and aggregates are comparable. We are much better able to estimate the number of rows that will come out of a join than we are to estimate the number of rows that come out of an aggregate. It's certainly true that in some cases we get join estimates badly wrong, and I'd like to see us do better there, but our estimates of the number of distinct values that exist in a column are the least reliable part of our statistics system by far. Also, we look at the underlying statistics for a column variable even after joins and aggregates and assume (not having any other information) that the distribution after that operation is likely to be similar to the distribution before that operation. Consider a table A with columns x and y. Let's say x is a unique ID and y is a dependent value with some distribution over a finite range of possibilities (e.g. a person's age). If we join table A to some other table B on A.x = B.x and filter out some of the rows via that join, the distribution of values in column y is likely to be altered. If the rows are removed at random, the original distribution will prevail, but often it won't be random and so the distribution will change in a way we can't predict. However, guessing pre-join distribution of A.y is still prevails isn't crazy, and it's better than assuming we can say nothing about the distribution. But now let's say that after joining to B, we perform an aggregation operation, computing the minimum value of A.y for each value of B.z. A this point, we have no usable statistics for either output column. The result must be unique on B.z, and the distribution of MIN(A.y) is going to be entirely different from the distribution of B.y. Any future joins that we perform here will have to be estimated without any MCVs, which is going to reduce the accuracy of the estimation by a lot. In summary, the join makes relying on our MCV information less likely to be accurate, but the aggregate makes it impossible to rely on our MCV information at all. In terms of the accuracy of our results, that is a lot worse. > I believe the HashAggregate node in this plan faces the same problem > with inaccurate estimates. However, I don't think it's reasonable to > say that, because we cannot accurately cost the Aggregate node, we > should disregard considering JOIN_UNIQUE_OUTER/INNER. Fair point. > Back in August, I responded to this issue by "Maybe we can run some > benchmarks first and investigate the regressions discovered on a > case-by-case basis". In October, I ran the TPC-DS benchmark at scale > 10 and observed that eager aggregation was applied in 7 queries, with > no notable regressions. In contrast, Q4 and Q11 showed performance > improvements of 3–4 times. Please see [1]. I had forgotten about that, and again, fair point, but I'm concerned that it might not be a broad enough base of queries to test against. (7 isn't a very large number.) > Actually, I introduced the EAGER_AGGREGATE_RATIO mechanism in October > to limit the planning effort for eager aggregation. For more details, > please see [2]. OK, I missed this, but... > And I don't think it's correct to say that we create a partially > grouped rel for every baserel and every joinrel. This patch includes > a bunch of logic to determine whether it's appropriate to create a > grouped rel for a base or join rel. Furthermore, with the > EAGER_AGGREGATE_RATIO mechanism, even if creating a grouped rel is > possible, we will skip it if the grouped paths are considered not > useful. All of these measures help reduce the number of grouped > paths as well as the grouped relations in many cases where eager > aggregation would not help a lot. ...it looks to me like EAGER_AGGREGATE_RATIO is used to set the RelAggInfo's agg_useful field, which seems like it happens after the RelOptInfo has already been created. I had been looking for something that would avoid creating the RelOptInfo in the first place and I didn't see it. > Based on the TPC-DS benchmark results, I don't see "a lot of overhead" > in the planning cost, at least for the 7 queries where eager > aggregation is applied. As I said in [1], "For the planning time, I > do not see notable regressions for any of the seven queries". In > fact, I initially thought that we might consider enabling this by > default, given the positive benchmark results, but I just couldn't > summon the courage to do it. Perhaps we should reconsider enabling it > by default, so users can benefit from the new feature and help > identify any potential bugs. If you're going to commit this, I think it would be a good idea to enable it by default at least for now. If there are problems, it's better to find out about them sooner rather than later. If they are minor they can be fixed; if they are major, we can consider whether it is better to fix them, disable the feature by default, or revert. We can add an open item to reconsider the default setting during beta. > > As I said back in October, this seems like mixing together in one > > RelOptInfo paths that really belong to two different RelOptInfos. > > I understand that you said about the design in October where > "PartialAgg(t1 JOIN t2) and t1 JOIN PartialAgg(t2) get separate > RelOptInfos", because "it's less clear whether it's fair to compare > across the two categories". I've shared my thoughts on this in [5]. > > Furthermore, even if we separate these grouped paths into two > different RelOptInfos, we still face the issue that "different paths > of a grouped relation can have very different row counts", and we need > a way to handle this. One could argue that we can separate the > grouped paths where partial aggregation is placed at different > locations into different RelOptInfos, but this would lead to an > explosion in the number of RelOptInfos for grouped relations as we > climb up the join tree. I think this is neither realistic nor > necessary. It's possible you're right, but it does make me nervous. I do agree that making the number of RelOptInfos explode would be really bad. -- Robert Haas EDB: http://www.enterprisedb.com