Re: Re: Re: Volcano's problem with trait propagation: current state and future

Stamatis Zampetakis Sun, 08 Dec 2019 01:42:08 -0800

Thanks Vladimir for this great summary. It is really helpful to know how
the different projects use the optimizer and it certainly helps to identify
limitations on our implementation.


I cannot provide any valuable feedback at the moment since I have to find
some time to read more carefully your analysis.

In the meantime, I know that Hive is also using Calcite for quite some time
now so maybe you can get some new ideas (or complete your background study)
by looking in their code.

@Haisheng: I think many people did appreciate the discussion for pull up
traits so I wouldn't say that we abandoned it. I had the impression that we
were waiting a design doc.

In general it may not be feasible to cover all use cases with a single
optimizer. I wouldn't find it bad to introduce another planner if there are
enough reasons to do so.

Best,
Stamatis


On Fri, Dec 6, 2019, 11:00 AM Vladimir Ozerov <[email protected]> wrote:

> "all we know is their *collations*" -> "all we know is their *traits*"
>
> пт, 6 дек. 2019 г. в 12:57, Vladimir Ozerov <[email protected]>:
>
> > Hi Haisheng,
> >
> > Thank you for your response. Let me elaborate my note on join planning
> > first - what I was trying to say is not that rules on their own have some
> > deficiencies. What I meant is that with current planner implementation,
> > users tend to separate join planning from the core optimization process
> > like this in the pseudo-code below. As a result, only one join
> permutation
> > is considered during physical planning, even though join rule may
> > potentially generate multiple plans worth exploring:
> >
> > RelNode optimizedLogicalNode = doJoinPlanning(logicalNode);
> > RelNode physicalNode = doPhysicalPlanning(optimizedLogicalNode);
> >
> > Now back to the main question. I re-read your thread about on-demand
> trait
> > propagation [1] carefully. I'd like to admit that when I was reading it
> for
> > the first time about a month ago, I failed to understand some details due
> > to poor knowledge of different optimizer architectures. Now I understand
> it
> > much better, and we definitely concerned with exactly the same problem. I
> > feel that trait pull-up might be a step in the right direction, however,
> it
> > seems to me that it is not the complete solution. Let me try to explain
> why
> > I think so.
> >
> > The efficient optimizer should try to save CPU as much as possible
> because
> > it allows us to explore more plans in a sensible amount of time. To
> achieve
> > that we should avoid redundant operations, and detect and prune
> inefficient
> > paths aggressively. As far as I understand the idea of trait pull-up, we
> > essentially explore the space of possible physical properties of children
> > nodes without forcing their implementation. But after that, the Calcite
> > will explore that nodes again, now in order to execute implementation
> > rules. I.e. we will do two dives - one to enumerate the nodes (trait
> > pull-up API), and the other one to implement them (implementation rules),
> > while in Cascades one dive should be sufficient since exploration invokes
> > the implementation rules as it goes. This is the first issue I see.
> >
> > The second one is more important - how to prune inefficient plans?
> > Currently, nodes are implemented independently and lack of context
> doesn't
> > allow us to estimates children's costs when implementing the parent,
> hence
> > branch-and-bound is not possible. Can trait pull-up API
> "List<RelTraitSet>
> > deriveTraitSets(RelNode, RelMetadataQuery)" help us with this? If the
> > children nodes are not implemented before the pull-up, all we know is
> their
> > collations, but not their costs. And without costs, pruning is not
> > possible. Please let me know if I missed something from the proposal.
> >
> > The possible architecture I had in mind after reading multiple papers,
> > which may answer all our questions, could look like this:
> > 1) We have a queue of nodes requiring optimization. Not a queue of rules.
> > initial queue state is formed from the initial tree, top-down.
> > 2) The node is popped from the queue, and we enter
> > "node.optimize(maxCost)" call. It checks for matching rules, prioritizes
> > them, and start their execution on by one. Execution of rules may
> re-insert
> > the current node into the queue, in which case this step is repeated,
> > possibly many times
> > 3) Logical-logical rules (transformations) produce new logical nodes and
> > put them into the queue for further optimization
> > 4) Logical-physical rules (implementation) do the following:
> > 4.1) Costs of logical children are estimated. The cost of a logical node
> > should be less than any cost of a possible physical node that may be
> > produced out of it. If the logical cost exceeds "maxCost", we stop and
> > return. The whole logical subspace is pruned even before exploration.
> > 4.2) Recursively call "childNode.optimize(maxCost - currentLogicalCost)"
> > method, which returns a set of possible physical implementations of a
> > child. Returned physical children are already registered in proper
> > set/subset, but are not used for any pattern-matching, and doesn't
> trigger
> > more rule calls!
> > 4.3) Implementation rule checks the cost of the physical child. If it is
> > greater than any other already observed child with the same traits, or
> > exceeds the "maxCost", it is discarded. Otherwise, the physical
> > implementation of the current node is produced and registered in the
> > optimizer.
> >
> > The pseudocode for physical implementation flow for join (two inputs):
> >
> > Collection<RelNode> optimizePhysical(Cost maxCost) {
> >     // Estimated minimal self-cost. Any physical implementation of this
> > node should have greater self-cost
> >     Cost logicalSelfCost = optimizer.getCost(this);
> >
> >     // *Pruning #1*: whatever children we implement, the total cost will
> > be greater than the passed maxCost, so do not explore further
> >     Cost maxChildCost = maxCost - logicalSelfCost;
> >
> >     Cost logicalILeftCost = optimizer.getCost(leftLogicalNode);
> >     Cost logicalRightCost = optimizer.getCost(rightLogicalNode);
> >
> >     if (logicalLeftCost + logicalRightCost > maxChildCost) {
> >         return;
> >     }
> >
> >     // This is our equivalence set.
> >     RelSet equivalenceSet = this.getSet();
> >
> >     // Get promising physical implementations of child nodes recursively
> >     List<RelNode> leftPhysicalNodes =
> > leftLogicalNode.optimizePhysical(maxChildCost);
> >     List<RelNode> rightPhysicalNodes =
> > rightLigicalNode.optimizePhysical(maxChildCost);
> >
> >     for (RelNode leftPhysicalNode : leftPhysicalNodes) {
> >         for (RelNode rightPhysicalNode : rightPhysicalNodes) {
> >             // *Pruning #2*: Combination of physical input costs is
> > already too expensive, give up
> >             Cost physicalLeftCost = optimizer.getCost(leftPhysicalNode);
> >             Cost physicalRightCost =
> optimizer.getCost(rightPhysicalNode);
> >
> >             if (logicalILeftCost + logicalRightCost > maxChildCost) {
> >                 continue.
> >             }
> >
> >             // Implement possible physical nodes for the given pair of
> > inputs (maybe more than one)
> >             List<RelNode> physicalJoins = implement(leftPhysicalNode,
> > rightPhysicalNode);
> >
> >             for (RelOptRule physicalJoin : physicalJoins) {
> >                // *Pruning #3*: Do not consider implementation if we have
> > another one with the same trait set and smaller cost)
> >                 Cost physicalJoinCost = optimizer.getCost(physicalJoin);
> >                 Cost bestCostForTraitSet =
> > equivalenceSet.getBestCost(physicalJoin.getTraitSet());
> >
> >                 if (physicalJoinCost > bestCostForTraitSet) {
> >                     continue.
> >                 }
> >
> >                 // This is a good implementation. Register it in the set,
> > updating per-traitset best costs.
> >                 equivalenceSet.register(physicalJoin);
> >             }
> >         }
> >     }
> >
> >     // Return the best registered expressions with different traitsets
> > from the current set.
> >     return equivalenceSet.getBestExps();
> > }
> >
> > This is a very rough pseudo-code, only to demonstrate the basic idea on
> > how proper bottom-up propagation not only helps us set proper traits for
> > the new physical node but also ensures that not optimal plans are pruned
> as
> > early as possible. Real implementation should be better abstracted and
> > accept enforcers as well.
> >
> > Also, please notice that the pseudo-code doesn't show when logical rules
> > are fired. This is a separate question. One possible straightforward way
> is
> > to add the aforementioned physical routine to normal Volcano flow:
> > 1) Fire logical rule on a node and register new nodes
> > 2) Fire physical optimization as shown above, then invoke
> > "call.transformTo()" for every returned physical rel
> > 3) Re-trigger the process for newly created nodes and their parents
> >
> > A better approach is to interleave logical and physical optimizations, so
> > they trigger each other recursively. But this would require a serious
> > redesign of a "rule queue" concept.
> >
> > Does it have any common points with your proposal?
> >
> > Regards,
> > Vladimir.
> >
> > [1]
> >
> https://ponymail-vm.apache.org/_GUI_/thread.html/79dac47ea50b5dfbd3f234e368ed61d247fb0eb989f87fe01aedaf25@%3Cdev.calcite.apache.org%3E
> >
> >
> > пт, 6 дек. 2019 г. в 05:41, Haisheng Yuan <[email protected]>:
> >
> >> Oh, I forgot to mention that the join planning/reordering is not a big
> >> problem. Calcite just use the rule to generate a single alternative
> plan,
> >> which is not ideal.  But we can't say Calcite is doing wrong.
> >>
> >> Ideally, we expect it generates multiple (neither all, nor single)
> >> bipartie graphs. The join reordering rule will cut each part into
> bipartie
> >> recursively and apply JoinCommutativity rule to generate more
> alternatives
> >> for each RelSet. It is just a different strategy. We can modify the
> rule,
> >> or create new join reordering rule to generate multiple plan
> >> alternatives.
> >>
> >> - Haisheng
> >>
> >> ------------------------------------------------------------------
> >> 发件人：Haisheng Yuan<[email protected]>
> >> 日 期：2019年12月06日 09:07:43
> >> 收件人：Vladimir Ozerov<[email protected]>; [email protected] (
> >> [email protected])<[email protected]>
> >> 主 题：Re: Re: Volcano's problem with trait propagation: current state and
> >> future
> >>
> >> Generally agree with what Vladimir said. I think what Calcite has is
> >> logical optimization or exploration, there are almost no physical
> >> optimization, Calcite leaves it to third party implementators. One of my
> >> friends at University of Wisconsin–Madison database research group told
> me
> >> that they gave up the idea of using Calcite in their project due to this
> >> reason.
> >>
> >> Currently physical properties are requested in implementation rules, or
> >> even logical exploration rules, But each rule is independent, the
> >> pattern-matched expression is not aware of what does the parent operator
> >> want. Using AbstractConverter doesn't help, and is not promising.
> >>
> >> >> You shouldn't regiester all logical rules in the planner
> >> simultaneously,... as Drill does.
> >> That is because Calcite does too many redundant or duplicate rule
> >> matching, like all kinds of transpose (can't be avoided due to current
> >> design), matching physical operators.
> >>
> >> >> decoupling the logical planning from the physical one looks
> >> a bit weird to me because it violates the idea of Cascades framework.
> >> Orca optimizer fully adopted the design principle of Cascades framework
> >> except that it separates into 3 phases: logical exploration, physical
> >> implementation, and optimization (property enforcing). And it might be
> >> easier if we want to enable parallel optimization by seperating into 3
> >> phases. Orca does branch-and-bound in optimization phase, before actual
> >> property derivation and enforcement, IIRC. It is highly efficient, works
> >> pretty well, and battlefield-tested by many large financial and
> insurance
> >> companies.
> >>
> >> In my last thread about on-demand trait request, I gave the high-level
> >> general API for physical operators to derive and require physical
> >> properties, which is similar to Orca's design. But seems like the
> proposal
> >> of API change gets no love.
> >>
> >> - Haisheng
> >>
> >> ------------------------------------------------------------------
> >> 发件人：Vladimir Ozerov<[email protected]>
> >> 日 期：2019年12月05日 22:22:43
> >> 收件人：[email protected] ([email protected])<
> >> [email protected]>
> >> 主 题：Re: Volcano's problem with trait propagation: current state and
> future
> >>
> >> AbstractConverter-s are attractive because they effectively emulate
> >> straightforward recursive top-down optimization (Volcano/Cascades). But
> >> instead of doing it with a recursive method call, which preserves the
> >> context, we do this in Calcite as a sequence of unrelated rule calls,
> thus
> >> losing the context. So with my current understanding, it could be
> thought
> >> of not as a search space explosion, but rather than the inefficient
> >> implementation of an otherwise straightforward parent->child->parent
> >> navigation, since we achieve this navigation indirectly through the rule
> >> queue, rather than through a normal method call. In any case, the net
> >> result is wasted CPU. Perhaps this is not exponential waste, but some
> >> multiplication of otherwise optimal planning. As I mentioned, in our
> >> experiments with TPC-H, we observed a constant factor between 6-9x
> between
> >> the number of joins and the number of join implementation rule
> >> invocations.
> >> It doesn't growth past 9 even for complex queries, so I hope that this
> is
> >> not an exponent :-)
> >>
> >> Speaking of logical vs physical optimization, IMO it makes sense in some
> >> cases. E.g. when doing predicate pushdown, you do not want to consider
> >> intermediate logical tree states for implementation, until the predicate
> >> reaches its final position. So running separate logical planning phase
> >> with
> >> Volcano optimizer makes total sense to me, because it effectively
> prunes a
> >> lot of not optimal logical plans before they reach the physical planning
> >> stage. The real problem to me is that we forced to remove join planning
> >> from the physical optimization stage. Because the goal of join planning
> >> not
> >> to generate a single optimal plan, like with predicate pushdown, but
> >> rather
> >> to generate a set of logical plans all of which should be implemented
> and
> >> estimated. And with AbstractConverter-s this is not possible because of
> >> their multiplicator increases the rate of search space growth, making
> join
> >> planning inapplicable even for the small number of relations. So we have
> >> to
> >> move them to the logical planning stage and pick only one permutation
> for
> >> physical planning.
> >>
> >>
> >> чт, 5 дек. 2019 г. в 15:35, Roman Kondakov <[email protected]
> >:
> >>
> >> > Vladimir,
> >> >
> >> > thank you for bringing it up. We are facing the same problems in
> Apache
> >> > Ignite project
> >> > and it would be great if Apache Calcite community will propose a
> >> > solution for this
> >> > issue.
> >> >
> >> > From my point of view an approach with abstract converters looks more
> >> > promising, but as
> >> > you mentioned it suffers from polluting the search space. The latter
> can
> >> > be mitigated by
> >> > splitting a planning stage into the several phases: you shouldn't
> >> > register all logical rules in the planner simultaneously - it looks
> like
> >> > it is better to have several iterations of planning stage with
> different
> >> > sets of rules, as Drill does.
> >> >
> >> > Also I'd like to mention that decoupling the logical planning from the
> >> > physical one looks
> >> > a bit weird to me because it violates the idea of Cascades framework.
> >> > Possibly this decoupling is the consequence of some performance
> issues.
> >> >
> >> >
> >> > --
> >> > Kind Regards
> >> > Roman Kondakov
> >> >
> >> > On 05.12.2019 14:24, Vladimir Ozerov wrote:
> >> > > Hi,
> >> > >
> >> > > As I mentioned before, we are building a distributed SQL engine that
> >> uses
> >> > > Apache Calcite for query optimization. The key problem we faced is
> the
> >> > > inability to pull the physical traits of child relations
> efficiently.
> >> I'd
> >> > > like to outline my understanding of the problem (I guess it was
> >> already
> >> > > discussed multiple times) and ask the community to prove or disprove
> >> the
> >> > > existence of that problem and its severity for the products which
> uses
> >> > > Apache Calcite and ask for ideas on how it could be improved in the
> >> > future.
> >> > >
> >> > > I'll start with the simplified problem description and mentioned
> more
> >> > > complex use cases then. Consider that we have a logical tree and a
> >> set of
> >> > > implementation rules. Our goal is to find the optimal physical tree
> by
> >> > > applying these rules. The classical Cascades-based approach directs
> >> the
> >> > > optimization process from the top to the bottom (hence "top-down").
> >> > > However, the actual implementation of tree nodes still happens
> >> bottom-up.
> >> > > For the tree L1 <- L2, we enter "optimize(L1)", which recursively
> >> > delegates
> >> > > to "optimize(L2)". We then implement children nodes L1 <- [P2',
> P2''],
> >> > and
> >> > > return back to the parent, which is now able to pick promising
> >> > > implementations of the children nodes and reject bad ones with the
> >> > > branch-and-bound approach. AFAIK Pivotal's Orca works this way.
> >> > >
> >> > > The Apache Calcite is very different because it doesn't allow the
> >> > recursion
> >> > > so that we lose the context on which node requested the child
> >> > > transformation. This loss of context leads to the following
> problems:
> >> > > 1) The parent node cannot deduce it's physical properties during the
> >> > > execution of the implementation rule, because Calcite expects the
> >> > > transformation to be applied before children nodes are implemented.
> >> That
> >> > is
> >> > > if we are optimizing LogicalProject <- LogicalScan, we cannot set
> >> proper
> >> > > distribution and collation for the to be created PhysicalProject,
> >> because
> >> > > it depends on the distribution and collation of the children which
> is
> >> yet
> >> > > to be resolved.
> >> > > 2) The branch-and-bound cannot be used because it requires at least
> >> one
> >> > > fully-built physical subtree.
> >> > >
> >> > > As a result of this limitation, products which rely on Apache
> Calcite
> >> for
> >> > > query optimization, use one or several workarounds:
> >> > >
> >> > > *1) Guess the physical properties of parent nodes before logical
> >> children
> >> > > are implemented*
> >> > > *Apache Flink* uses this strategy. The strategy is bad because of
> the
> >> > > number of combinations of traits growth exponentially with the
> number
> >> of
> >> > > attributes in the given RelNode, so you either explode the search
> >> space
> >> > or
> >> > > give up optimization opportunities. Consider the following tree:
> >> > > LogicalSort[a ASC] <- LogicalFilter <- LogicalScan
> >> > > The optimal implementation of the LogicalFilter is
> >> > PhysicalFilter[collation=a
> >> > > ASC] because it may eliminate the parent sort. But such optimization
> >> > should
> >> > > happen only if we know that there is a physical implementation of
> scan
> >> > > allowing for this sort order, e.g. PhysicalIndexScan[collation=a
> ASC].
> >> > I.e.
> >> > > we need to know the child physical properties first. Otherwise we
> >> > fallback
> >> > > to speculative approaches. With the *optimistic* approach, we emit
> all
> >> > > possible combinations of physical properties, with the hope that the
> >> > child
> >> > > will satisfy some of them, thus expanding the search space
> >> exponentially.
> >> > > With the *pessimistic* approach, we just miss this optimization
> >> > opportunity
> >> > > even if the index exists. Apache Flink uses the pessimistic
> approach.
> >> > >
> >> > > *2) Use AbstractConverters*
> >> > > *Apache Drill* uses this strategy. The idea is to "glue" logical and
> >> > > physical operators, so that implementation of a physical child
> >> > re-triggers
> >> > > implementation rule of a logical parent. The flow is as follows:
> >> > > - Invoke parent implementation rule - it either doesn't produce new
> >> > > physical nodes or produce not optimized physical nodes (like in the
> >> > Apache
> >> > > Flink case)
> >> > > - Invoke children implementation rules and create physical children
> >> > > - Then converters kick-in and re-trigger parent implementation rule
> >> > through
> >> > > the creation of an abstract converter
> >> > > - Finally, the parent implementation rule is fired again and now it
> >> > > produces optimized node(s) since at least some of the physical
> >> > > distributions of children nodes are implemented.
> >> > >
> >> > > Note that this is essentially a hack to simulate the Cascades flow!
> >> The
> >> > > problem is that AbstractConverters increase the complexity of
> planning
> >> > > because they do not have any context, so parent rules are just
> >> > re-triggered
> >> > > blindly. Consider the optimization of the following tree:
> >> > > LogicalJoin <- [LogicalScan1, LogicalScan2]
> >> > > With the converter approach, the join implementation rule will be
> >> fired
> >> > at
> >> > > least 3 times, while in reality, one call should be sufficient. In
> our
> >> > > experiments with TPC-H queries, the join rule implemented that way
> is
> >> > > typically called 6-9 times more often than expected.
> >> > >
> >> > > *3) Transformations (i.e. logical optimization) are decoupled from
> >> > > implementation (i.e. physical optimization)*
> >> > > Normally, you would like to mix both logical and physical rules in a
> >> > single
> >> > > optimization program, because it is required for proper planning.
> That
> >> > is,
> >> > > you should consider both (Ax(BxC)) and ((AxB)xC) join ordering
> during
> >> > > physical optimization, because you do not know which one will
> produce
> >> the
> >> > > better plan in advance.
> >> > > But in some practical implementations of Calcite-based optimizers,
> >> this
> >> > is
> >> > > not the case, and join planning is performed as a separate HEP
> stage.
> >> > > Examples are Apache Drill and Apache Flink.
> >> > > I believe that lack of Cascades-style flow and branch-and-bound are
> >> among
> >> > > the main reasons for this. At the very least for Apache Drill, since
> >> it
> >> > > uses converters, so additional logical permutations will
> exponentially
> >> > > multiply the number of fired rules, which is already very big.
> >> > >
> >> > > Given all these problems I'd like to ask the community to share
> >> current
> >> > > thoughts and ideas on the future improvement of the Calcite
> optimizer.
> >> > One
> >> > > of the ideas being discussed in the community is "Pull-up Traits",
> >> which
> >> > > should allow the parent node to get physical properties of the
> >> children
> >> > > nodes. But in order to do this, you effectively need to implement
> >> > children,
> >> > > which IMO makes this process indistinguishable from the classical
> >> > recursive
> >> > > Cascades algorithm.
> >> > >
> >> > > Have you considered recursive transformations as an alternative
> >> solution
> >> > to
> >> > > that problem? I.e. instead of trying to guess or pull the physical
> >> > > properties of non-existent physical nodes, go ahead and actually
> >> > implement
> >> > > them directly from within the parent rule? This may resolve the
> >> problem
> >> > > with trait pull-up, as well as allow for branch-and-bound
> >> optimization.
> >> > >
> >> > > I would appreciate your feedback or any hints for future research.
> >> > >
> >> > > Regards,
> >> > > Vladimir.
> >> > >
> >> >
> >>
> >>
> >>
>

Re: Re: Re: Volcano's problem with trait propagation: current state and future

Reply via email to