Hi Julian,

First of all, I want to clarify that if we decide to do the cross-join
suppression, we would have to do that for JoinCommuteRule either. I missed
that in the original email.

Now back to your comments. The proposal assumes that we start with a
connected join graph without the cross-products and never generate other
cross-products. This covers many common queries, but not all. First, as you
mentioned, some queries may potentially benefit from cross-products. I do
not have specific queries in mind, but I cannot prove that such queries
don't exist either. Second, we may have cross-products in an otherwise
connected join graph if the join condition is located in the WHERE clause.
For such queries, we may need to do some pre-processing, like filter
pushdown. Therefore, I assume that real-world products would have to do
some preliminary analysis and/or processing to decide whether it is safe to
use the cross-join suppression. This might not be very convenient for new
users, but IMO perfectly fine for production-grade systems because they
typically do complex multi-phase optimization anyway. For example, recall
how VoltDB decides whether to perform the join planning in the physical
phase depending on the number of joins in the query [1].

Regarding cardinality estimations, the main problem is that the built-in
Calcite mechanism is pretty imprecise because they use mainly row count and
magic numbers. For example, in TPC-H queries, the cardinality estimations
easily reach billions and billions of rows. And even if the system has more
advanced cardinality estimators, such as histograms, the estimation errors
are likely to build up quickly still [2]. That said, it could be difficult
to design a robust heuristic to suppress particular joins. The advantage of
a straightforward cross-join suppression heuristic is that it is a good
choice for a good fraction of real-world queries.

To clarify, the proposal is not to make the cross-join suppression the
default behavior. Rather, we may add it as a configuration property to join
planning rules, that could be used by advanced users.

Regards,
Vladimir.

[1]
https://github.com/VoltDB/voltdb/blob/0f2993cb9e1efe7c2c95cf68b83f10903e2697d3/src/frontend/org/voltdb/compiler/PlannerTool.java#L274
[2] https://dl.acm.org/doi/10.14778/2850583.2850594

сб, 13 февр. 2021 г. в 23:05, Julian Hyde <[email protected]>:

> Two separate points.
>
> 1. I recall that there is an important class of plans that first forms the
> Cartesian product of some very small relations, and then joins everything
> else to it. Does anyone else encounter these plans? Would these plans be
> unavailable if we made your proposed change?
>
> 2. If I join on a low cardinality attribute (eg customers to suppliers on
> state) the result is almost as bad a cross join. I presume you would want
> to treat this join similarly. If so, it would make sense to have the rules
> on selectivity (or similar) rather than structure alone.
>
> Julian
>
> > On Feb 12, 2021, at 23:04, Vladimir Ozerov <[email protected]> wrote:
> >
> > Hi,
> >
> > Join order planning is an important optimization problem. The widely-used
> > heuristic is to consider all bushy trees without cross-joins. There is
> > proof [1] that a pair of commute and associate rules is a complete
> ruleset
> > to explore all bushy trees without the cross-joins if the initial join
> tree
> > has no cross-joins.
> >
> > In Apache Calcite, we do not suppress cross-products. As a result, even
> > simple join topologies, like a chain, cannot be planned in a reasonable
> > time for more than 6-7 tables. How do you feel if we add an optional
> > cross-join suppression to the JoinAssociateRule, and possibly
> > JoinPushThroughJoinRule?
> >
> > The cross-join suppression is not the only problem with the exhaustive
> join
> > planning in Apache Calcite. But at the very least, it is simple to
> > implement, and it extends the number of tables in a join that could be
> > planned exhaustively for some topologies by additional 2-5 tables.
> >
> > Regards,
> > Vladimir.
> >
> > [1] https://dl.acm.org/doi/10.14778/2732977.2732997
>

Reply via email to