"all we know is their *collations*" -> "all we know is their *traits*"

пт, 6 дек. 2019 г. в 12:57, Vladimir Ozerov <[email protected]>:

> Hi Haisheng,
>
> Thank you for your response. Let me elaborate my note on join planning
> first - what I was trying to say is not that rules on their own have some
> deficiencies. What I meant is that with current planner implementation,
> users tend to separate join planning from the core optimization process
> like this in the pseudo-code below. As a result, only one join permutation
> is considered during physical planning, even though join rule may
> potentially generate multiple plans worth exploring:
>
> RelNode optimizedLogicalNode = doJoinPlanning(logicalNode);
> RelNode physicalNode = doPhysicalPlanning(optimizedLogicalNode);
>
> Now back to the main question. I re-read your thread about on-demand trait
> propagation [1] carefully. I'd like to admit that when I was reading it for
> the first time about a month ago, I failed to understand some details due
> to poor knowledge of different optimizer architectures. Now I understand it
> much better, and we definitely concerned with exactly the same problem. I
> feel that trait pull-up might be a step in the right direction, however, it
> seems to me that it is not the complete solution. Let me try to explain why
> I think so.
>
> The efficient optimizer should try to save CPU as much as possible because
> it allows us to explore more plans in a sensible amount of time. To achieve
> that we should avoid redundant operations, and detect and prune inefficient
> paths aggressively. As far as I understand the idea of trait pull-up, we
> essentially explore the space of possible physical properties of children
> nodes without forcing their implementation. But after that, the Calcite
> will explore that nodes again, now in order to execute implementation
> rules. I.e. we will do two dives - one to enumerate the nodes (trait
> pull-up API), and the other one to implement them (implementation rules),
> while in Cascades one dive should be sufficient since exploration invokes
> the implementation rules as it goes. This is the first issue I see.
>
> The second one is more important - how to prune inefficient plans?
> Currently, nodes are implemented independently and lack of context doesn't
> allow us to estimates children's costs when implementing the parent, hence
> branch-and-bound is not possible. Can trait pull-up API "List<RelTraitSet>
> deriveTraitSets(RelNode, RelMetadataQuery)" help us with this? If the
> children nodes are not implemented before the pull-up, all we know is their
> collations, but not their costs. And without costs, pruning is not
> possible. Please let me know if I missed something from the proposal.
>
> The possible architecture I had in mind after reading multiple papers,
> which may answer all our questions, could look like this:
> 1) We have a queue of nodes requiring optimization. Not a queue of rules.
> initial queue state is formed from the initial tree, top-down.
> 2) The node is popped from the queue, and we enter
> "node.optimize(maxCost)" call. It checks for matching rules, prioritizes
> them, and start their execution on by one. Execution of rules may re-insert
> the current node into the queue, in which case this step is repeated,
> possibly many times
> 3) Logical-logical rules (transformations) produce new logical nodes and
> put them into the queue for further optimization
> 4) Logical-physical rules (implementation) do the following:
> 4.1) Costs of logical children are estimated. The cost of a logical node
> should be less than any cost of a possible physical node that may be
> produced out of it. If the logical cost exceeds "maxCost", we stop and
> return. The whole logical subspace is pruned even before exploration.
> 4.2) Recursively call "childNode.optimize(maxCost - currentLogicalCost)"
> method, which returns a set of possible physical implementations of a
> child. Returned physical children are already registered in proper
> set/subset, but are not used for any pattern-matching, and doesn't trigger
> more rule calls!
> 4.3) Implementation rule checks the cost of the physical child. If it is
> greater than any other already observed child with the same traits, or
> exceeds the "maxCost", it is discarded. Otherwise, the physical
> implementation of the current node is produced and registered in the
> optimizer.
>
> The pseudocode for physical implementation flow for join (two inputs):
>
> Collection<RelNode> optimizePhysical(Cost maxCost) {
>     // Estimated minimal self-cost. Any physical implementation of this
> node should have greater self-cost
>     Cost logicalSelfCost = optimizer.getCost(this);
>
>     // *Pruning #1*: whatever children we implement, the total cost will
> be greater than the passed maxCost, so do not explore further
>     Cost maxChildCost = maxCost - logicalSelfCost;
>
>     Cost logicalILeftCost = optimizer.getCost(leftLogicalNode);
>     Cost logicalRightCost = optimizer.getCost(rightLogicalNode);
>
>     if (logicalLeftCost + logicalRightCost > maxChildCost) {
>         return;
>     }
>
>     // This is our equivalence set.
>     RelSet equivalenceSet = this.getSet();
>
>     // Get promising physical implementations of child nodes recursively
>     List<RelNode> leftPhysicalNodes =
> leftLogicalNode.optimizePhysical(maxChildCost);
>     List<RelNode> rightPhysicalNodes =
> rightLigicalNode.optimizePhysical(maxChildCost);
>
>     for (RelNode leftPhysicalNode : leftPhysicalNodes) {
>         for (RelNode rightPhysicalNode : rightPhysicalNodes) {
>             // *Pruning #2*: Combination of physical input costs is
> already too expensive, give up
>             Cost physicalLeftCost = optimizer.getCost(leftPhysicalNode);
>             Cost physicalRightCost = optimizer.getCost(rightPhysicalNode);
>
>             if (logicalILeftCost + logicalRightCost > maxChildCost) {
>                 continue.
>             }
>
>             // Implement possible physical nodes for the given pair of
> inputs (maybe more than one)
>             List<RelNode> physicalJoins = implement(leftPhysicalNode,
> rightPhysicalNode);
>
>             for (RelOptRule physicalJoin : physicalJoins) {
>                // *Pruning #3*: Do not consider implementation if we have
> another one with the same trait set and smaller cost)
>                 Cost physicalJoinCost = optimizer.getCost(physicalJoin);
>                 Cost bestCostForTraitSet =
> equivalenceSet.getBestCost(physicalJoin.getTraitSet());
>
>                 if (physicalJoinCost > bestCostForTraitSet) {
>                     continue.
>                 }
>
>                 // This is a good implementation. Register it in the set,
> updating per-traitset best costs.
>                 equivalenceSet.register(physicalJoin);
>             }
>         }
>     }
>
>     // Return the best registered expressions with different traitsets
> from the current set.
>     return equivalenceSet.getBestExps();
> }
>
> This is a very rough pseudo-code, only to demonstrate the basic idea on
> how proper bottom-up propagation not only helps us set proper traits for
> the new physical node but also ensures that not optimal plans are pruned as
> early as possible. Real implementation should be better abstracted and
> accept enforcers as well.
>
> Also, please notice that the pseudo-code doesn't show when logical rules
> are fired. This is a separate question. One possible straightforward way is
> to add the aforementioned physical routine to normal Volcano flow:
> 1) Fire logical rule on a node and register new nodes
> 2) Fire physical optimization as shown above, then invoke
> "call.transformTo()" for every returned physical rel
> 3) Re-trigger the process for newly created nodes and their parents
>
> A better approach is to interleave logical and physical optimizations, so
> they trigger each other recursively. But this would require a serious
> redesign of a "rule queue" concept.
>
> Does it have any common points with your proposal?
>
> Regards,
> Vladimir.
>
> [1]
> https://ponymail-vm.apache.org/_GUI_/thread.html/79dac47ea50b5dfbd3f234e368ed61d247fb0eb989f87fe01aedaf25@%3Cdev.calcite.apache.org%3E
>
>
> пт, 6 дек. 2019 г. в 05:41, Haisheng Yuan <[email protected]>:
>
>> Oh, I forgot to mention that the join planning/reordering is not a big
>> problem. Calcite just use the rule to generate a single alternative plan,
>> which is not ideal.  But we can't say Calcite is doing wrong.
>>
>> Ideally, we expect it generates multiple (neither all, nor single)
>> bipartie graphs. The join reordering rule will cut each part into bipartie
>> recursively and apply JoinCommutativity rule to generate more alternatives
>> for each RelSet. It is just a different strategy. We can modify the rule,
>> or create new join reordering rule to generate multiple plan
>> alternatives.
>>
>> - Haisheng
>>
>> ------------------------------------------------------------------
>> 发件人:Haisheng Yuan<[email protected]>
>> 日 期:2019年12月06日 09:07:43
>> 收件人:Vladimir Ozerov<[email protected]>; [email protected] (
>> [email protected])<[email protected]>
>> 主 题:Re: Re: Volcano's problem with trait propagation: current state and
>> future
>>
>> Generally agree with what Vladimir said. I think what Calcite has is
>> logical optimization or exploration, there are almost no physical
>> optimization, Calcite leaves it to third party implementators. One of my
>> friends at University of Wisconsin–Madison database research group told me
>> that they gave up the idea of using Calcite in their project due to this
>> reason.
>>
>> Currently physical properties are requested in implementation rules, or
>> even logical exploration rules, But each rule is independent, the
>> pattern-matched expression is not aware of what does the parent operator
>> want. Using AbstractConverter doesn't help, and is not promising.
>>
>> >> You shouldn't regiester all logical rules in the planner
>> simultaneously,... as Drill does.
>> That is because Calcite does too many redundant or duplicate rule
>> matching, like all kinds of transpose (can't be avoided due to current
>> design), matching physical operators.
>>
>> >> decoupling the logical planning from the physical one looks
>> a bit weird to me because it violates the idea of Cascades framework.
>> Orca optimizer fully adopted the design principle of Cascades framework
>> except that it separates into 3 phases: logical exploration, physical
>> implementation, and optimization (property enforcing). And it might be
>> easier if we want to enable parallel optimization by seperating into 3
>> phases. Orca does branch-and-bound in optimization phase, before actual
>> property derivation and enforcement, IIRC. It is highly efficient, works
>> pretty well, and battlefield-tested by many large financial and insurance
>> companies.
>>
>> In my last thread about on-demand trait request, I gave the high-level
>> general API for physical operators to derive and require physical
>> properties, which is similar to Orca's design. But seems like the proposal
>> of API change gets no love.
>>
>> - Haisheng
>>
>> ------------------------------------------------------------------
>> 发件人:Vladimir Ozerov<[email protected]>
>> 日 期:2019年12月05日 22:22:43
>> 收件人:[email protected] ([email protected])<
>> [email protected]>
>> 主 题:Re: Volcano's problem with trait propagation: current state and future
>>
>> AbstractConverter-s are attractive because they effectively emulate
>> straightforward recursive top-down optimization (Volcano/Cascades). But
>> instead of doing it with a recursive method call, which preserves the
>> context, we do this in Calcite as a sequence of unrelated rule calls, thus
>> losing the context. So with my current understanding, it could be thought
>> of not as a search space explosion, but rather than the inefficient
>> implementation of an otherwise straightforward parent->child->parent
>> navigation, since we achieve this navigation indirectly through the rule
>> queue, rather than through a normal method call. In any case, the net
>> result is wasted CPU. Perhaps this is not exponential waste, but some
>> multiplication of otherwise optimal planning. As I mentioned, in our
>> experiments with TPC-H, we observed a constant factor between 6-9x between
>> the number of joins and the number of join implementation rule
>> invocations.
>> It doesn't growth past 9 even for complex queries, so I hope that this is
>> not an exponent :-)
>>
>> Speaking of logical vs physical optimization, IMO it makes sense in some
>> cases. E.g. when doing predicate pushdown, you do not want to consider
>> intermediate logical tree states for implementation, until the predicate
>> reaches its final position. So running separate logical planning phase
>> with
>> Volcano optimizer makes total sense to me, because it effectively prunes a
>> lot of not optimal logical plans before they reach the physical planning
>> stage. The real problem to me is that we forced to remove join planning
>> from the physical optimization stage. Because the goal of join planning
>> not
>> to generate a single optimal plan, like with predicate pushdown, but
>> rather
>> to generate a set of logical plans all of which should be implemented and
>> estimated. And with AbstractConverter-s this is not possible because of
>> their multiplicator increases the rate of search space growth, making join
>> planning inapplicable even for the small number of relations. So we have
>> to
>> move them to the logical planning stage and pick only one permutation for
>> physical planning.
>>
>>
>> чт, 5 дек. 2019 г. в 15:35, Roman Kondakov <[email protected]>:
>>
>> > Vladimir,
>> >
>> > thank you for bringing it up. We are facing the same problems in Apache
>> > Ignite project
>> > and it would be great if Apache Calcite community will propose a
>> > solution for this
>> > issue.
>> >
>> > From my point of view an approach with abstract converters looks more
>> > promising, but as
>> > you mentioned it suffers from polluting the search space. The latter can
>> > be mitigated by
>> > splitting a planning stage into the several phases: you shouldn't
>> > register all logical rules in the planner simultaneously - it looks like
>> > it is better to have several iterations of planning stage with different
>> > sets of rules, as Drill does.
>> >
>> > Also I'd like to mention that decoupling the logical planning from the
>> > physical one looks
>> > a bit weird to me because it violates the idea of Cascades framework.
>> > Possibly this decoupling is the consequence of some performance issues.
>> >
>> >
>> > --
>> > Kind Regards
>> > Roman Kondakov
>> >
>> > On 05.12.2019 14:24, Vladimir Ozerov wrote:
>> > > Hi,
>> > >
>> > > As I mentioned before, we are building a distributed SQL engine that
>> uses
>> > > Apache Calcite for query optimization. The key problem we faced is the
>> > > inability to pull the physical traits of child relations efficiently.
>> I'd
>> > > like to outline my understanding of the problem (I guess it was
>> already
>> > > discussed multiple times) and ask the community to prove or disprove
>> the
>> > > existence of that problem and its severity for the products which uses
>> > > Apache Calcite and ask for ideas on how it could be improved in the
>> > future.
>> > >
>> > > I'll start with the simplified problem description and mentioned more
>> > > complex use cases then. Consider that we have a logical tree and a
>> set of
>> > > implementation rules. Our goal is to find the optimal physical tree by
>> > > applying these rules. The classical Cascades-based approach directs
>> the
>> > > optimization process from the top to the bottom (hence "top-down").
>> > > However, the actual implementation of tree nodes still happens
>> bottom-up.
>> > > For the tree L1 <- L2, we enter "optimize(L1)", which recursively
>> > delegates
>> > > to "optimize(L2)". We then implement children nodes L1 <- [P2', P2''],
>> > and
>> > > return back to the parent, which is now able to pick promising
>> > > implementations of the children nodes and reject bad ones with the
>> > > branch-and-bound approach. AFAIK Pivotal's Orca works this way.
>> > >
>> > > The Apache Calcite is very different because it doesn't allow the
>> > recursion
>> > > so that we lose the context on which node requested the child
>> > > transformation. This loss of context leads to the following problems:
>> > > 1) The parent node cannot deduce it's physical properties during the
>> > > execution of the implementation rule, because Calcite expects the
>> > > transformation to be applied before children nodes are implemented.
>> That
>> > is
>> > > if we are optimizing LogicalProject <- LogicalScan, we cannot set
>> proper
>> > > distribution and collation for the to be created PhysicalProject,
>> because
>> > > it depends on the distribution and collation of the children which is
>> yet
>> > > to be resolved.
>> > > 2) The branch-and-bound cannot be used because it requires at least
>> one
>> > > fully-built physical subtree.
>> > >
>> > > As a result of this limitation, products which rely on Apache Calcite
>> for
>> > > query optimization, use one or several workarounds:
>> > >
>> > > *1) Guess the physical properties of parent nodes before logical
>> children
>> > > are implemented*
>> > > *Apache Flink* uses this strategy. The strategy is bad because of the
>> > > number of combinations of traits growth exponentially with the number
>> of
>> > > attributes in the given RelNode, so you either explode the search
>> space
>> > or
>> > > give up optimization opportunities. Consider the following tree:
>> > > LogicalSort[a ASC] <- LogicalFilter <- LogicalScan
>> > > The optimal implementation of the LogicalFilter is
>> > PhysicalFilter[collation=a
>> > > ASC] because it may eliminate the parent sort. But such optimization
>> > should
>> > > happen only if we know that there is a physical implementation of scan
>> > > allowing for this sort order, e.g. PhysicalIndexScan[collation=a ASC].
>> > I.e.
>> > > we need to know the child physical properties first. Otherwise we
>> > fallback
>> > > to speculative approaches. With the *optimistic* approach, we emit all
>> > > possible combinations of physical properties, with the hope that the
>> > child
>> > > will satisfy some of them, thus expanding the search space
>> exponentially.
>> > > With the *pessimistic* approach, we just miss this optimization
>> > opportunity
>> > > even if the index exists. Apache Flink uses the pessimistic approach.
>> > >
>> > > *2) Use AbstractConverters*
>> > > *Apache Drill* uses this strategy. The idea is to "glue" logical and
>> > > physical operators, so that implementation of a physical child
>> > re-triggers
>> > > implementation rule of a logical parent. The flow is as follows:
>> > > - Invoke parent implementation rule - it either doesn't produce new
>> > > physical nodes or produce not optimized physical nodes (like in the
>> > Apache
>> > > Flink case)
>> > > - Invoke children implementation rules and create physical children
>> > > - Then converters kick-in and re-trigger parent implementation rule
>> > through
>> > > the creation of an abstract converter
>> > > - Finally, the parent implementation rule is fired again and now it
>> > > produces optimized node(s) since at least some of the physical
>> > > distributions of children nodes are implemented.
>> > >
>> > > Note that this is essentially a hack to simulate the Cascades flow!
>> The
>> > > problem is that AbstractConverters increase the complexity of planning
>> > > because they do not have any context, so parent rules are just
>> > re-triggered
>> > > blindly. Consider the optimization of the following tree:
>> > > LogicalJoin <- [LogicalScan1, LogicalScan2]
>> > > With the converter approach, the join implementation rule will be
>> fired
>> > at
>> > > least 3 times, while in reality, one call should be sufficient. In our
>> > > experiments with TPC-H queries, the join rule implemented that way is
>> > > typically called 6-9 times more often than expected.
>> > >
>> > > *3) Transformations (i.e. logical optimization) are decoupled from
>> > > implementation (i.e. physical optimization)*
>> > > Normally, you would like to mix both logical and physical rules in a
>> > single
>> > > optimization program, because it is required for proper planning. That
>> > is,
>> > > you should consider both (Ax(BxC)) and ((AxB)xC) join ordering during
>> > > physical optimization, because you do not know which one will produce
>> the
>> > > better plan in advance.
>> > > But in some practical implementations of Calcite-based optimizers,
>> this
>> > is
>> > > not the case, and join planning is performed as a separate HEP stage.
>> > > Examples are Apache Drill and Apache Flink.
>> > > I believe that lack of Cascades-style flow and branch-and-bound are
>> among
>> > > the main reasons for this. At the very least for Apache Drill, since
>> it
>> > > uses converters, so additional logical permutations will exponentially
>> > > multiply the number of fired rules, which is already very big.
>> > >
>> > > Given all these problems I'd like to ask the community to share
>> current
>> > > thoughts and ideas on the future improvement of the Calcite optimizer.
>> > One
>> > > of the ideas being discussed in the community is "Pull-up Traits",
>> which
>> > > should allow the parent node to get physical properties of the
>> children
>> > > nodes. But in order to do this, you effectively need to implement
>> > children,
>> > > which IMO makes this process indistinguishable from the classical
>> > recursive
>> > > Cascades algorithm.
>> > >
>> > > Have you considered recursive transformations as an alternative
>> solution
>> > to
>> > > that problem? I.e. instead of trying to guess or pull the physical
>> > > properties of non-existent physical nodes, go ahead and actually
>> > implement
>> > > them directly from within the parent rule? This may resolve the
>> problem
>> > > with trait pull-up, as well as allow for branch-and-bound
>> optimization.
>> > >
>> > > I would appreciate your feedback or any hints for future research.
>> > >
>> > > Regards,
>> > > Vladimir.
>> > >
>> >
>>
>>
>>

Reply via email to