I created the PR [1]. It adds a property that prunes new join trees if there is an "always true" condition. Strictly speaking, this is not a suppression of all possible cross-joins. But such check is cheap and should work well in practical optimizers assuming that a filter-pushdown is applied, and joins contain only those conditions, that couldn't be pushed down.
[1] https://github.com/apache/calcite/pull/2359 сб, 20 февр. 2021 г. в 20:34, Vladimir Ozerov <[email protected]>: > Thank you for the feedback. I'll try to prototype it for all the affected > rules. > > Regards, > Vladimir > > чт, 18 февр. 2021 г. в 13:02, Stamatis Zampetakis <[email protected]>: > >> Hi Vladimir, >> >> Thanks for bringing up this topic. >> >> There are various works who highlight the importance of cartesian products >> for optimality and try to derive techniques for exploring the complete >> search space efficiently. >> Having said that indeed not considering cartesian products is a common & >> popular heuristic for queries with many relations so I find it a good idea >> to give to the users the option to use it or not. >> >> Best, >> Stamatis >> >> On Sun, Feb 14, 2021 at 9:58 AM Vladimir Ozerov <[email protected]> >> wrote: >> >> > Hi Julian, >> > >> > First of all, I want to clarify that if we decide to do the cross-join >> > suppression, we would have to do that for JoinCommuteRule either. I >> missed >> > that in the original email. >> > >> > Now back to your comments. The proposal assumes that we start with a >> > connected join graph without the cross-products and never generate other >> > cross-products. This covers many common queries, but not all. First, as >> you >> > mentioned, some queries may potentially benefit from cross-products. I >> do >> > not have specific queries in mind, but I cannot prove that such queries >> > don't exist either. Second, we may have cross-products in an otherwise >> > connected join graph if the join condition is located in the WHERE >> clause. >> > For such queries, we may need to do some pre-processing, like filter >> > pushdown. Therefore, I assume that real-world products would have to do >> > some preliminary analysis and/or processing to decide whether it is >> safe to >> > use the cross-join suppression. This might not be very convenient for >> new >> > users, but IMO perfectly fine for production-grade systems because they >> > typically do complex multi-phase optimization anyway. For example, >> recall >> > how VoltDB decides whether to perform the join planning in the physical >> > phase depending on the number of joins in the query [1]. >> > >> > Regarding cardinality estimations, the main problem is that the built-in >> > Calcite mechanism is pretty imprecise because they use mainly row count >> and >> > magic numbers. For example, in TPC-H queries, the cardinality >> estimations >> > easily reach billions and billions of rows. And even if the system has >> more >> > advanced cardinality estimators, such as histograms, the estimation >> errors >> > are likely to build up quickly still [2]. That said, it could be >> difficult >> > to design a robust heuristic to suppress particular joins. The >> advantage of >> > a straightforward cross-join suppression heuristic is that it is a good >> > choice for a good fraction of real-world queries. >> > >> > To clarify, the proposal is not to make the cross-join suppression the >> > default behavior. Rather, we may add it as a configuration property to >> join >> > planning rules, that could be used by advanced users. >> > >> > Regards, >> > Vladimir. >> > >> > [1] >> > >> > >> https://github.com/VoltDB/voltdb/blob/0f2993cb9e1efe7c2c95cf68b83f10903e2697d3/src/frontend/org/voltdb/compiler/PlannerTool.java#L274 >> > [2] https://dl.acm.org/doi/10.14778/2850583.2850594 >> > >> > сб, 13 февр. 2021 г. в 23:05, Julian Hyde <[email protected]>: >> > >> > > Two separate points. >> > > >> > > 1. I recall that there is an important class of plans that first forms >> > the >> > > Cartesian product of some very small relations, and then joins >> everything >> > > else to it. Does anyone else encounter these plans? Would these plans >> be >> > > unavailable if we made your proposed change? >> > > >> > > 2. If I join on a low cardinality attribute (eg customers to >> suppliers on >> > > state) the result is almost as bad a cross join. I presume you would >> want >> > > to treat this join similarly. If so, it would make sense to have the >> > rules >> > > on selectivity (or similar) rather than structure alone. >> > > >> > > Julian >> > > >> > > > On Feb 12, 2021, at 23:04, Vladimir Ozerov <[email protected]> >> wrote: >> > > > >> > > > Hi, >> > > > >> > > > Join order planning is an important optimization problem. The >> > widely-used >> > > > heuristic is to consider all bushy trees without cross-joins. There >> is >> > > > proof [1] that a pair of commute and associate rules is a complete >> > > ruleset >> > > > to explore all bushy trees without the cross-joins if the initial >> join >> > > tree >> > > > has no cross-joins. >> > > > >> > > > In Apache Calcite, we do not suppress cross-products. As a result, >> even >> > > > simple join topologies, like a chain, cannot be planned in a >> reasonable >> > > > time for more than 6-7 tables. How do you feel if we add an optional >> > > > cross-join suppression to the JoinAssociateRule, and possibly >> > > > JoinPushThroughJoinRule? >> > > > >> > > > The cross-join suppression is not the only problem with the >> exhaustive >> > > join >> > > > planning in Apache Calcite. But at the very least, it is simple to >> > > > implement, and it extends the number of tables in a join that could >> be >> > > > planned exhaustively for some topologies by additional 2-5 tables. >> > > > >> > > > Regards, >> > > > Vladimir. >> > > > >> > > > [1] https://dl.acm.org/doi/10.14778/2732977.2732997 >> > > >> > >> >
