Thank you for the feedback. I'll try to prototype it for all the affected rules.
Regards, Vladimir чт, 18 февр. 2021 г. в 13:02, Stamatis Zampetakis <[email protected]>: > Hi Vladimir, > > Thanks for bringing up this topic. > > There are various works who highlight the importance of cartesian products > for optimality and try to derive techniques for exploring the complete > search space efficiently. > Having said that indeed not considering cartesian products is a common & > popular heuristic for queries with many relations so I find it a good idea > to give to the users the option to use it or not. > > Best, > Stamatis > > On Sun, Feb 14, 2021 at 9:58 AM Vladimir Ozerov <[email protected]> > wrote: > > > Hi Julian, > > > > First of all, I want to clarify that if we decide to do the cross-join > > suppression, we would have to do that for JoinCommuteRule either. I > missed > > that in the original email. > > > > Now back to your comments. The proposal assumes that we start with a > > connected join graph without the cross-products and never generate other > > cross-products. This covers many common queries, but not all. First, as > you > > mentioned, some queries may potentially benefit from cross-products. I do > > not have specific queries in mind, but I cannot prove that such queries > > don't exist either. Second, we may have cross-products in an otherwise > > connected join graph if the join condition is located in the WHERE > clause. > > For such queries, we may need to do some pre-processing, like filter > > pushdown. Therefore, I assume that real-world products would have to do > > some preliminary analysis and/or processing to decide whether it is safe > to > > use the cross-join suppression. This might not be very convenient for new > > users, but IMO perfectly fine for production-grade systems because they > > typically do complex multi-phase optimization anyway. For example, recall > > how VoltDB decides whether to perform the join planning in the physical > > phase depending on the number of joins in the query [1]. > > > > Regarding cardinality estimations, the main problem is that the built-in > > Calcite mechanism is pretty imprecise because they use mainly row count > and > > magic numbers. For example, in TPC-H queries, the cardinality estimations > > easily reach billions and billions of rows. And even if the system has > more > > advanced cardinality estimators, such as histograms, the estimation > errors > > are likely to build up quickly still [2]. That said, it could be > difficult > > to design a robust heuristic to suppress particular joins. The advantage > of > > a straightforward cross-join suppression heuristic is that it is a good > > choice for a good fraction of real-world queries. > > > > To clarify, the proposal is not to make the cross-join suppression the > > default behavior. Rather, we may add it as a configuration property to > join > > planning rules, that could be used by advanced users. > > > > Regards, > > Vladimir. > > > > [1] > > > > > https://github.com/VoltDB/voltdb/blob/0f2993cb9e1efe7c2c95cf68b83f10903e2697d3/src/frontend/org/voltdb/compiler/PlannerTool.java#L274 > > [2] https://dl.acm.org/doi/10.14778/2850583.2850594 > > > > сб, 13 февр. 2021 г. в 23:05, Julian Hyde <[email protected]>: > > > > > Two separate points. > > > > > > 1. I recall that there is an important class of plans that first forms > > the > > > Cartesian product of some very small relations, and then joins > everything > > > else to it. Does anyone else encounter these plans? Would these plans > be > > > unavailable if we made your proposed change? > > > > > > 2. If I join on a low cardinality attribute (eg customers to suppliers > on > > > state) the result is almost as bad a cross join. I presume you would > want > > > to treat this join similarly. If so, it would make sense to have the > > rules > > > on selectivity (or similar) rather than structure alone. > > > > > > Julian > > > > > > > On Feb 12, 2021, at 23:04, Vladimir Ozerov <[email protected]> > wrote: > > > > > > > > Hi, > > > > > > > > Join order planning is an important optimization problem. The > > widely-used > > > > heuristic is to consider all bushy trees without cross-joins. There > is > > > > proof [1] that a pair of commute and associate rules is a complete > > > ruleset > > > > to explore all bushy trees without the cross-joins if the initial > join > > > tree > > > > has no cross-joins. > > > > > > > > In Apache Calcite, we do not suppress cross-products. As a result, > even > > > > simple join topologies, like a chain, cannot be planned in a > reasonable > > > > time for more than 6-7 tables. How do you feel if we add an optional > > > > cross-join suppression to the JoinAssociateRule, and possibly > > > > JoinPushThroughJoinRule? > > > > > > > > The cross-join suppression is not the only problem with the > exhaustive > > > join > > > > planning in Apache Calcite. But at the very least, it is simple to > > > > implement, and it extends the number of tables in a join that could > be > > > > planned exhaustively for some topologies by additional 2-5 tables. > > > > > > > > Regards, > > > > Vladimir. > > > > > > > > [1] https://dl.acm.org/doi/10.14778/2732977.2732997 > > > > > >
