Hi,
Apache Calcite supports heterogeneous optimization when nodes may have
different conventions. The Enumerable rules propagate all traits from
inputs. We have doubts whether this is correct or not.
Consider the following initial plan, which was created by Apache Calcite
after sql-to-rel conversion and invocation of TranslatableTable.toRel. The
table is in the CUSTOM convention. In this convention, there is an
additional Distribution trait that tracks which attribute is used for
sharding. It could be either SHARDED or ANY. The latter is the default
distribution value which is used when the distribution is unknown. Suppose
that the table is distributed by the attribute $0.
LogicalProject [convention=NONE, distribution=ANY]
CustomTable [convention=CUSTOM, distribution=SHARDED($0)]
Now suppose that we run VolcanoPlanner with two rules: EnumerableProjectRule
and converter rules that translate the CUSTOM node to ENUMERABLE node.
First, the EnumerableProjectRule is executed. This rule propagates traits
from the input, replacing only convention. Notice, how it propagated the
distribution trait.
EnumerableProject [convention=ENUMERABLE, distribution=SHARDED($0)]
CustomTable [convention=CUSTOM, distribution=SHARDED($0)]
Next, the converter will be invoked, yielding the following final plan:
EnumerableProject [convention=ENUMERABLE, distribution=SHARDED($0)]
CustomToEnumerable [convention=ENUMERABLE, distribution=???]
CustomTable [convention=CUSTOM, distribution=SHARDED($0)]
There are two problems here. First, the project operator potentially
destroys any trait which depends on column order, such as distribution or
collation. Therefore, EnumerableProject has an incorrect value of the
distribution trait.
Second, which distribution should I assign to the CustomToEnumerable node?
As I know that parent convention cannot handle the distribution properly,
my natural thought is to set it to ANY. However, at least in the top-down
optimizer, this will lead to CannotPlanException, unless I declare that [ANY
satisfies SHARDED($0)], which is not the case: ANY is unknown distribution,
so all distribution satisfies ANY, but not vice versa.
My question is - shouldn't we ensure that only the collation trait is
propagated from child nodes in Enumerable rules? For example, in the
EnumerableProjectRule instead of doing:
input.getTraitSet()
.replace(EnumerableConvention.INSTANCE)
.replace(<newCollation>)
we may do:
RelOptCluster.traitSet().
.replace(EnumerableConvention.INSTANCE)
.replace(<newCollation>)
This would ensure that all other traits are set to the default value. The
generalization of this idea is that every convention has a set of supported
traits. Every unsupported trait should be set to the default value.
I would appreciate your feedback on the matter.
Regards,
Vladimir.