Hi Vladimir, Thanks for bringing up this topic.
There are various works who highlight the importance of cartesian products for optimality and try to derive techniques for exploring the complete search space efficiently. Having said that indeed not considering cartesian products is a common & popular heuristic for queries with many relations so I find it a good idea to give to the users the option to use it or not. Best, Stamatis On Sun, Feb 14, 2021 at 9:58 AM Vladimir Ozerov <[email protected]> wrote: > Hi Julian, > > First of all, I want to clarify that if we decide to do the cross-join > suppression, we would have to do that for JoinCommuteRule either. I missed > that in the original email. > > Now back to your comments. The proposal assumes that we start with a > connected join graph without the cross-products and never generate other > cross-products. This covers many common queries, but not all. First, as you > mentioned, some queries may potentially benefit from cross-products. I do > not have specific queries in mind, but I cannot prove that such queries > don't exist either. Second, we may have cross-products in an otherwise > connected join graph if the join condition is located in the WHERE clause. > For such queries, we may need to do some pre-processing, like filter > pushdown. Therefore, I assume that real-world products would have to do > some preliminary analysis and/or processing to decide whether it is safe to > use the cross-join suppression. This might not be very convenient for new > users, but IMO perfectly fine for production-grade systems because they > typically do complex multi-phase optimization anyway. For example, recall > how VoltDB decides whether to perform the join planning in the physical > phase depending on the number of joins in the query [1]. > > Regarding cardinality estimations, the main problem is that the built-in > Calcite mechanism is pretty imprecise because they use mainly row count and > magic numbers. For example, in TPC-H queries, the cardinality estimations > easily reach billions and billions of rows. And even if the system has more > advanced cardinality estimators, such as histograms, the estimation errors > are likely to build up quickly still [2]. That said, it could be difficult > to design a robust heuristic to suppress particular joins. The advantage of > a straightforward cross-join suppression heuristic is that it is a good > choice for a good fraction of real-world queries. > > To clarify, the proposal is not to make the cross-join suppression the > default behavior. Rather, we may add it as a configuration property to join > planning rules, that could be used by advanced users. > > Regards, > Vladimir. > > [1] > > https://github.com/VoltDB/voltdb/blob/0f2993cb9e1efe7c2c95cf68b83f10903e2697d3/src/frontend/org/voltdb/compiler/PlannerTool.java#L274 > [2] https://dl.acm.org/doi/10.14778/2850583.2850594 > > сб, 13 февр. 2021 г. в 23:05, Julian Hyde <[email protected]>: > > > Two separate points. > > > > 1. I recall that there is an important class of plans that first forms > the > > Cartesian product of some very small relations, and then joins everything > > else to it. Does anyone else encounter these plans? Would these plans be > > unavailable if we made your proposed change? > > > > 2. If I join on a low cardinality attribute (eg customers to suppliers on > > state) the result is almost as bad a cross join. I presume you would want > > to treat this join similarly. If so, it would make sense to have the > rules > > on selectivity (or similar) rather than structure alone. > > > > Julian > > > > > On Feb 12, 2021, at 23:04, Vladimir Ozerov <[email protected]> wrote: > > > > > > Hi, > > > > > > Join order planning is an important optimization problem. The > widely-used > > > heuristic is to consider all bushy trees without cross-joins. There is > > > proof [1] that a pair of commute and associate rules is a complete > > ruleset > > > to explore all bushy trees without the cross-joins if the initial join > > tree > > > has no cross-joins. > > > > > > In Apache Calcite, we do not suppress cross-products. As a result, even > > > simple join topologies, like a chain, cannot be planned in a reasonable > > > time for more than 6-7 tables. How do you feel if we add an optional > > > cross-join suppression to the JoinAssociateRule, and possibly > > > JoinPushThroughJoinRule? > > > > > > The cross-join suppression is not the only problem with the exhaustive > > join > > > planning in Apache Calcite. But at the very least, it is simple to > > > implement, and it extends the number of tables in a join that could be > > > planned exhaustively for some topologies by additional 2-5 tables. > > > > > > Regards, > > > Vladimir. > > > > > > [1] https://dl.acm.org/doi/10.14778/2732977.2732997 > > >
