I created the PR [1]. It adds a property that prunes new join trees if
there is an "always true" condition. Strictly speaking, this is not a
suppression of all possible cross-joins. But such check is cheap and should
work well in practical optimizers assuming that a filter-pushdown is
applied, and joins contain only those conditions, that couldn't be
pushed down.

[1] https://github.com/apache/calcite/pull/2359


сб, 20 февр. 2021 г. в 20:34, Vladimir Ozerov <[email protected]>:

> Thank you for the feedback. I'll try to prototype it for all the affected
> rules.
>
> Regards,
> Vladimir
>
> чт, 18 февр. 2021 г. в 13:02, Stamatis Zampetakis <[email protected]>:
>
>> Hi Vladimir,
>>
>> Thanks for bringing up this topic.
>>
>> There are various works who highlight the importance of cartesian products
>> for optimality and try to derive techniques for exploring the complete
>> search space efficiently.
>> Having said that indeed not considering cartesian products is a common &
>> popular heuristic for queries with many relations so I find it a good idea
>> to give to the users the option to use it or not.
>>
>> Best,
>> Stamatis
>>
>> On Sun, Feb 14, 2021 at 9:58 AM Vladimir Ozerov <[email protected]>
>> wrote:
>>
>> > Hi Julian,
>> >
>> > First of all, I want to clarify that if we decide to do the cross-join
>> > suppression, we would have to do that for JoinCommuteRule either. I
>> missed
>> > that in the original email.
>> >
>> > Now back to your comments. The proposal assumes that we start with a
>> > connected join graph without the cross-products and never generate other
>> > cross-products. This covers many common queries, but not all. First, as
>> you
>> > mentioned, some queries may potentially benefit from cross-products. I
>> do
>> > not have specific queries in mind, but I cannot prove that such queries
>> > don't exist either. Second, we may have cross-products in an otherwise
>> > connected join graph if the join condition is located in the WHERE
>> clause.
>> > For such queries, we may need to do some pre-processing, like filter
>> > pushdown. Therefore, I assume that real-world products would have to do
>> > some preliminary analysis and/or processing to decide whether it is
>> safe to
>> > use the cross-join suppression. This might not be very convenient for
>> new
>> > users, but IMO perfectly fine for production-grade systems because they
>> > typically do complex multi-phase optimization anyway. For example,
>> recall
>> > how VoltDB decides whether to perform the join planning in the physical
>> > phase depending on the number of joins in the query [1].
>> >
>> > Regarding cardinality estimations, the main problem is that the built-in
>> > Calcite mechanism is pretty imprecise because they use mainly row count
>> and
>> > magic numbers. For example, in TPC-H queries, the cardinality
>> estimations
>> > easily reach billions and billions of rows. And even if the system has
>> more
>> > advanced cardinality estimators, such as histograms, the estimation
>> errors
>> > are likely to build up quickly still [2]. That said, it could be
>> difficult
>> > to design a robust heuristic to suppress particular joins. The
>> advantage of
>> > a straightforward cross-join suppression heuristic is that it is a good
>> > choice for a good fraction of real-world queries.
>> >
>> > To clarify, the proposal is not to make the cross-join suppression the
>> > default behavior. Rather, we may add it as a configuration property to
>> join
>> > planning rules, that could be used by advanced users.
>> >
>> > Regards,
>> > Vladimir.
>> >
>> > [1]
>> >
>> >
>> https://github.com/VoltDB/voltdb/blob/0f2993cb9e1efe7c2c95cf68b83f10903e2697d3/src/frontend/org/voltdb/compiler/PlannerTool.java#L274
>> > [2] https://dl.acm.org/doi/10.14778/2850583.2850594
>> >
>> > сб, 13 февр. 2021 г. в 23:05, Julian Hyde <[email protected]>:
>> >
>> > > Two separate points.
>> > >
>> > > 1. I recall that there is an important class of plans that first forms
>> > the
>> > > Cartesian product of some very small relations, and then joins
>> everything
>> > > else to it. Does anyone else encounter these plans? Would these plans
>> be
>> > > unavailable if we made your proposed change?
>> > >
>> > > 2. If I join on a low cardinality attribute (eg customers to
>> suppliers on
>> > > state) the result is almost as bad a cross join. I presume you would
>> want
>> > > to treat this join similarly. If so, it would make sense to have the
>> > rules
>> > > on selectivity (or similar) rather than structure alone.
>> > >
>> > > Julian
>> > >
>> > > > On Feb 12, 2021, at 23:04, Vladimir Ozerov <[email protected]>
>> wrote:
>> > > >
>> > > > Hi,
>> > > >
>> > > > Join order planning is an important optimization problem. The
>> > widely-used
>> > > > heuristic is to consider all bushy trees without cross-joins. There
>> is
>> > > > proof [1] that a pair of commute and associate rules is a complete
>> > > ruleset
>> > > > to explore all bushy trees without the cross-joins if the initial
>> join
>> > > tree
>> > > > has no cross-joins.
>> > > >
>> > > > In Apache Calcite, we do not suppress cross-products. As a result,
>> even
>> > > > simple join topologies, like a chain, cannot be planned in a
>> reasonable
>> > > > time for more than 6-7 tables. How do you feel if we add an optional
>> > > > cross-join suppression to the JoinAssociateRule, and possibly
>> > > > JoinPushThroughJoinRule?
>> > > >
>> > > > The cross-join suppression is not the only problem with the
>> exhaustive
>> > > join
>> > > > planning in Apache Calcite. But at the very least, it is simple to
>> > > > implement, and it extends the number of tables in a join that could
>> be
>> > > > planned exhaustively for some topologies by additional 2-5 tables.
>> > > >
>> > > > Regards,
>> > > > Vladimir.
>> > > >
>> > > > [1] https://dl.acm.org/doi/10.14778/2732977.2732997
>> > >
>> >
>>
>

Reply via email to