"all we know is their *collations*" -> "all we know is their *traits*"
пт, 6 дек. 2019 г. в 12:57, Vladimir Ozerov <[email protected]>: > Hi Haisheng, > > Thank you for your response. Let me elaborate my note on join planning > first - what I was trying to say is not that rules on their own have some > deficiencies. What I meant is that with current planner implementation, > users tend to separate join planning from the core optimization process > like this in the pseudo-code below. As a result, only one join permutation > is considered during physical planning, even though join rule may > potentially generate multiple plans worth exploring: > > RelNode optimizedLogicalNode = doJoinPlanning(logicalNode); > RelNode physicalNode = doPhysicalPlanning(optimizedLogicalNode); > > Now back to the main question. I re-read your thread about on-demand trait > propagation [1] carefully. I'd like to admit that when I was reading it for > the first time about a month ago, I failed to understand some details due > to poor knowledge of different optimizer architectures. Now I understand it > much better, and we definitely concerned with exactly the same problem. I > feel that trait pull-up might be a step in the right direction, however, it > seems to me that it is not the complete solution. Let me try to explain why > I think so. > > The efficient optimizer should try to save CPU as much as possible because > it allows us to explore more plans in a sensible amount of time. To achieve > that we should avoid redundant operations, and detect and prune inefficient > paths aggressively. As far as I understand the idea of trait pull-up, we > essentially explore the space of possible physical properties of children > nodes without forcing their implementation. But after that, the Calcite > will explore that nodes again, now in order to execute implementation > rules. I.e. we will do two dives - one to enumerate the nodes (trait > pull-up API), and the other one to implement them (implementation rules), > while in Cascades one dive should be sufficient since exploration invokes > the implementation rules as it goes. This is the first issue I see. > > The second one is more important - how to prune inefficient plans? > Currently, nodes are implemented independently and lack of context doesn't > allow us to estimates children's costs when implementing the parent, hence > branch-and-bound is not possible. Can trait pull-up API "List<RelTraitSet> > deriveTraitSets(RelNode, RelMetadataQuery)" help us with this? If the > children nodes are not implemented before the pull-up, all we know is their > collations, but not their costs. And without costs, pruning is not > possible. Please let me know if I missed something from the proposal. > > The possible architecture I had in mind after reading multiple papers, > which may answer all our questions, could look like this: > 1) We have a queue of nodes requiring optimization. Not a queue of rules. > initial queue state is formed from the initial tree, top-down. > 2) The node is popped from the queue, and we enter > "node.optimize(maxCost)" call. It checks for matching rules, prioritizes > them, and start their execution on by one. Execution of rules may re-insert > the current node into the queue, in which case this step is repeated, > possibly many times > 3) Logical-logical rules (transformations) produce new logical nodes and > put them into the queue for further optimization > 4) Logical-physical rules (implementation) do the following: > 4.1) Costs of logical children are estimated. The cost of a logical node > should be less than any cost of a possible physical node that may be > produced out of it. If the logical cost exceeds "maxCost", we stop and > return. The whole logical subspace is pruned even before exploration. > 4.2) Recursively call "childNode.optimize(maxCost - currentLogicalCost)" > method, which returns a set of possible physical implementations of a > child. Returned physical children are already registered in proper > set/subset, but are not used for any pattern-matching, and doesn't trigger > more rule calls! > 4.3) Implementation rule checks the cost of the physical child. If it is > greater than any other already observed child with the same traits, or > exceeds the "maxCost", it is discarded. Otherwise, the physical > implementation of the current node is produced and registered in the > optimizer. > > The pseudocode for physical implementation flow for join (two inputs): > > Collection<RelNode> optimizePhysical(Cost maxCost) { > // Estimated minimal self-cost. Any physical implementation of this > node should have greater self-cost > Cost logicalSelfCost = optimizer.getCost(this); > > // *Pruning #1*: whatever children we implement, the total cost will > be greater than the passed maxCost, so do not explore further > Cost maxChildCost = maxCost - logicalSelfCost; > > Cost logicalILeftCost = optimizer.getCost(leftLogicalNode); > Cost logicalRightCost = optimizer.getCost(rightLogicalNode); > > if (logicalLeftCost + logicalRightCost > maxChildCost) { > return; > } > > // This is our equivalence set. > RelSet equivalenceSet = this.getSet(); > > // Get promising physical implementations of child nodes recursively > List<RelNode> leftPhysicalNodes = > leftLogicalNode.optimizePhysical(maxChildCost); > List<RelNode> rightPhysicalNodes = > rightLigicalNode.optimizePhysical(maxChildCost); > > for (RelNode leftPhysicalNode : leftPhysicalNodes) { > for (RelNode rightPhysicalNode : rightPhysicalNodes) { > // *Pruning #2*: Combination of physical input costs is > already too expensive, give up > Cost physicalLeftCost = optimizer.getCost(leftPhysicalNode); > Cost physicalRightCost = optimizer.getCost(rightPhysicalNode); > > if (logicalILeftCost + logicalRightCost > maxChildCost) { > continue. > } > > // Implement possible physical nodes for the given pair of > inputs (maybe more than one) > List<RelNode> physicalJoins = implement(leftPhysicalNode, > rightPhysicalNode); > > for (RelOptRule physicalJoin : physicalJoins) { > // *Pruning #3*: Do not consider implementation if we have > another one with the same trait set and smaller cost) > Cost physicalJoinCost = optimizer.getCost(physicalJoin); > Cost bestCostForTraitSet = > equivalenceSet.getBestCost(physicalJoin.getTraitSet()); > > if (physicalJoinCost > bestCostForTraitSet) { > continue. > } > > // This is a good implementation. Register it in the set, > updating per-traitset best costs. > equivalenceSet.register(physicalJoin); > } > } > } > > // Return the best registered expressions with different traitsets > from the current set. > return equivalenceSet.getBestExps(); > } > > This is a very rough pseudo-code, only to demonstrate the basic idea on > how proper bottom-up propagation not only helps us set proper traits for > the new physical node but also ensures that not optimal plans are pruned as > early as possible. Real implementation should be better abstracted and > accept enforcers as well. > > Also, please notice that the pseudo-code doesn't show when logical rules > are fired. This is a separate question. One possible straightforward way is > to add the aforementioned physical routine to normal Volcano flow: > 1) Fire logical rule on a node and register new nodes > 2) Fire physical optimization as shown above, then invoke > "call.transformTo()" for every returned physical rel > 3) Re-trigger the process for newly created nodes and their parents > > A better approach is to interleave logical and physical optimizations, so > they trigger each other recursively. But this would require a serious > redesign of a "rule queue" concept. > > Does it have any common points with your proposal? > > Regards, > Vladimir. > > [1] > https://ponymail-vm.apache.org/_GUI_/thread.html/79dac47ea50b5dfbd3f234e368ed61d247fb0eb989f87fe01aedaf25@%3Cdev.calcite.apache.org%3E > > > пт, 6 дек. 2019 г. в 05:41, Haisheng Yuan <[email protected]>: > >> Oh, I forgot to mention that the join planning/reordering is not a big >> problem. Calcite just use the rule to generate a single alternative plan, >> which is not ideal. But we can't say Calcite is doing wrong. >> >> Ideally, we expect it generates multiple (neither all, nor single) >> bipartie graphs. The join reordering rule will cut each part into bipartie >> recursively and apply JoinCommutativity rule to generate more alternatives >> for each RelSet. It is just a different strategy. We can modify the rule, >> or create new join reordering rule to generate multiple plan >> alternatives. >> >> - Haisheng >> >> ------------------------------------------------------------------ >> 发件人:Haisheng Yuan<[email protected]> >> 日 期:2019年12月06日 09:07:43 >> 收件人:Vladimir Ozerov<[email protected]>; [email protected] ( >> [email protected])<[email protected]> >> 主 题:Re: Re: Volcano's problem with trait propagation: current state and >> future >> >> Generally agree with what Vladimir said. I think what Calcite has is >> logical optimization or exploration, there are almost no physical >> optimization, Calcite leaves it to third party implementators. One of my >> friends at University of Wisconsin–Madison database research group told me >> that they gave up the idea of using Calcite in their project due to this >> reason. >> >> Currently physical properties are requested in implementation rules, or >> even logical exploration rules, But each rule is independent, the >> pattern-matched expression is not aware of what does the parent operator >> want. Using AbstractConverter doesn't help, and is not promising. >> >> >> You shouldn't regiester all logical rules in the planner >> simultaneously,... as Drill does. >> That is because Calcite does too many redundant or duplicate rule >> matching, like all kinds of transpose (can't be avoided due to current >> design), matching physical operators. >> >> >> decoupling the logical planning from the physical one looks >> a bit weird to me because it violates the idea of Cascades framework. >> Orca optimizer fully adopted the design principle of Cascades framework >> except that it separates into 3 phases: logical exploration, physical >> implementation, and optimization (property enforcing). And it might be >> easier if we want to enable parallel optimization by seperating into 3 >> phases. Orca does branch-and-bound in optimization phase, before actual >> property derivation and enforcement, IIRC. It is highly efficient, works >> pretty well, and battlefield-tested by many large financial and insurance >> companies. >> >> In my last thread about on-demand trait request, I gave the high-level >> general API for physical operators to derive and require physical >> properties, which is similar to Orca's design. But seems like the proposal >> of API change gets no love. >> >> - Haisheng >> >> ------------------------------------------------------------------ >> 发件人:Vladimir Ozerov<[email protected]> >> 日 期:2019年12月05日 22:22:43 >> 收件人:[email protected] ([email protected])< >> [email protected]> >> 主 题:Re: Volcano's problem with trait propagation: current state and future >> >> AbstractConverter-s are attractive because they effectively emulate >> straightforward recursive top-down optimization (Volcano/Cascades). But >> instead of doing it with a recursive method call, which preserves the >> context, we do this in Calcite as a sequence of unrelated rule calls, thus >> losing the context. So with my current understanding, it could be thought >> of not as a search space explosion, but rather than the inefficient >> implementation of an otherwise straightforward parent->child->parent >> navigation, since we achieve this navigation indirectly through the rule >> queue, rather than through a normal method call. In any case, the net >> result is wasted CPU. Perhaps this is not exponential waste, but some >> multiplication of otherwise optimal planning. As I mentioned, in our >> experiments with TPC-H, we observed a constant factor between 6-9x between >> the number of joins and the number of join implementation rule >> invocations. >> It doesn't growth past 9 even for complex queries, so I hope that this is >> not an exponent :-) >> >> Speaking of logical vs physical optimization, IMO it makes sense in some >> cases. E.g. when doing predicate pushdown, you do not want to consider >> intermediate logical tree states for implementation, until the predicate >> reaches its final position. So running separate logical planning phase >> with >> Volcano optimizer makes total sense to me, because it effectively prunes a >> lot of not optimal logical plans before they reach the physical planning >> stage. The real problem to me is that we forced to remove join planning >> from the physical optimization stage. Because the goal of join planning >> not >> to generate a single optimal plan, like with predicate pushdown, but >> rather >> to generate a set of logical plans all of which should be implemented and >> estimated. And with AbstractConverter-s this is not possible because of >> their multiplicator increases the rate of search space growth, making join >> planning inapplicable even for the small number of relations. So we have >> to >> move them to the logical planning stage and pick only one permutation for >> physical planning. >> >> >> чт, 5 дек. 2019 г. в 15:35, Roman Kondakov <[email protected]>: >> >> > Vladimir, >> > >> > thank you for bringing it up. We are facing the same problems in Apache >> > Ignite project >> > and it would be great if Apache Calcite community will propose a >> > solution for this >> > issue. >> > >> > From my point of view an approach with abstract converters looks more >> > promising, but as >> > you mentioned it suffers from polluting the search space. The latter can >> > be mitigated by >> > splitting a planning stage into the several phases: you shouldn't >> > register all logical rules in the planner simultaneously - it looks like >> > it is better to have several iterations of planning stage with different >> > sets of rules, as Drill does. >> > >> > Also I'd like to mention that decoupling the logical planning from the >> > physical one looks >> > a bit weird to me because it violates the idea of Cascades framework. >> > Possibly this decoupling is the consequence of some performance issues. >> > >> > >> > -- >> > Kind Regards >> > Roman Kondakov >> > >> > On 05.12.2019 14:24, Vladimir Ozerov wrote: >> > > Hi, >> > > >> > > As I mentioned before, we are building a distributed SQL engine that >> uses >> > > Apache Calcite for query optimization. The key problem we faced is the >> > > inability to pull the physical traits of child relations efficiently. >> I'd >> > > like to outline my understanding of the problem (I guess it was >> already >> > > discussed multiple times) and ask the community to prove or disprove >> the >> > > existence of that problem and its severity for the products which uses >> > > Apache Calcite and ask for ideas on how it could be improved in the >> > future. >> > > >> > > I'll start with the simplified problem description and mentioned more >> > > complex use cases then. Consider that we have a logical tree and a >> set of >> > > implementation rules. Our goal is to find the optimal physical tree by >> > > applying these rules. The classical Cascades-based approach directs >> the >> > > optimization process from the top to the bottom (hence "top-down"). >> > > However, the actual implementation of tree nodes still happens >> bottom-up. >> > > For the tree L1 <- L2, we enter "optimize(L1)", which recursively >> > delegates >> > > to "optimize(L2)". We then implement children nodes L1 <- [P2', P2''], >> > and >> > > return back to the parent, which is now able to pick promising >> > > implementations of the children nodes and reject bad ones with the >> > > branch-and-bound approach. AFAIK Pivotal's Orca works this way. >> > > >> > > The Apache Calcite is very different because it doesn't allow the >> > recursion >> > > so that we lose the context on which node requested the child >> > > transformation. This loss of context leads to the following problems: >> > > 1) The parent node cannot deduce it's physical properties during the >> > > execution of the implementation rule, because Calcite expects the >> > > transformation to be applied before children nodes are implemented. >> That >> > is >> > > if we are optimizing LogicalProject <- LogicalScan, we cannot set >> proper >> > > distribution and collation for the to be created PhysicalProject, >> because >> > > it depends on the distribution and collation of the children which is >> yet >> > > to be resolved. >> > > 2) The branch-and-bound cannot be used because it requires at least >> one >> > > fully-built physical subtree. >> > > >> > > As a result of this limitation, products which rely on Apache Calcite >> for >> > > query optimization, use one or several workarounds: >> > > >> > > *1) Guess the physical properties of parent nodes before logical >> children >> > > are implemented* >> > > *Apache Flink* uses this strategy. The strategy is bad because of the >> > > number of combinations of traits growth exponentially with the number >> of >> > > attributes in the given RelNode, so you either explode the search >> space >> > or >> > > give up optimization opportunities. Consider the following tree: >> > > LogicalSort[a ASC] <- LogicalFilter <- LogicalScan >> > > The optimal implementation of the LogicalFilter is >> > PhysicalFilter[collation=a >> > > ASC] because it may eliminate the parent sort. But such optimization >> > should >> > > happen only if we know that there is a physical implementation of scan >> > > allowing for this sort order, e.g. PhysicalIndexScan[collation=a ASC]. >> > I.e. >> > > we need to know the child physical properties first. Otherwise we >> > fallback >> > > to speculative approaches. With the *optimistic* approach, we emit all >> > > possible combinations of physical properties, with the hope that the >> > child >> > > will satisfy some of them, thus expanding the search space >> exponentially. >> > > With the *pessimistic* approach, we just miss this optimization >> > opportunity >> > > even if the index exists. Apache Flink uses the pessimistic approach. >> > > >> > > *2) Use AbstractConverters* >> > > *Apache Drill* uses this strategy. The idea is to "glue" logical and >> > > physical operators, so that implementation of a physical child >> > re-triggers >> > > implementation rule of a logical parent. The flow is as follows: >> > > - Invoke parent implementation rule - it either doesn't produce new >> > > physical nodes or produce not optimized physical nodes (like in the >> > Apache >> > > Flink case) >> > > - Invoke children implementation rules and create physical children >> > > - Then converters kick-in and re-trigger parent implementation rule >> > through >> > > the creation of an abstract converter >> > > - Finally, the parent implementation rule is fired again and now it >> > > produces optimized node(s) since at least some of the physical >> > > distributions of children nodes are implemented. >> > > >> > > Note that this is essentially a hack to simulate the Cascades flow! >> The >> > > problem is that AbstractConverters increase the complexity of planning >> > > because they do not have any context, so parent rules are just >> > re-triggered >> > > blindly. Consider the optimization of the following tree: >> > > LogicalJoin <- [LogicalScan1, LogicalScan2] >> > > With the converter approach, the join implementation rule will be >> fired >> > at >> > > least 3 times, while in reality, one call should be sufficient. In our >> > > experiments with TPC-H queries, the join rule implemented that way is >> > > typically called 6-9 times more often than expected. >> > > >> > > *3) Transformations (i.e. logical optimization) are decoupled from >> > > implementation (i.e. physical optimization)* >> > > Normally, you would like to mix both logical and physical rules in a >> > single >> > > optimization program, because it is required for proper planning. That >> > is, >> > > you should consider both (Ax(BxC)) and ((AxB)xC) join ordering during >> > > physical optimization, because you do not know which one will produce >> the >> > > better plan in advance. >> > > But in some practical implementations of Calcite-based optimizers, >> this >> > is >> > > not the case, and join planning is performed as a separate HEP stage. >> > > Examples are Apache Drill and Apache Flink. >> > > I believe that lack of Cascades-style flow and branch-and-bound are >> among >> > > the main reasons for this. At the very least for Apache Drill, since >> it >> > > uses converters, so additional logical permutations will exponentially >> > > multiply the number of fired rules, which is already very big. >> > > >> > > Given all these problems I'd like to ask the community to share >> current >> > > thoughts and ideas on the future improvement of the Calcite optimizer. >> > One >> > > of the ideas being discussed in the community is "Pull-up Traits", >> which >> > > should allow the parent node to get physical properties of the >> children >> > > nodes. But in order to do this, you effectively need to implement >> > children, >> > > which IMO makes this process indistinguishable from the classical >> > recursive >> > > Cascades algorithm. >> > > >> > > Have you considered recursive transformations as an alternative >> solution >> > to >> > > that problem? I.e. instead of trying to guess or pull the physical >> > > properties of non-existent physical nodes, go ahead and actually >> > implement >> > > them directly from within the parent rule? This may resolve the >> problem >> > > with trait pull-up, as well as allow for branch-and-bound >> optimization. >> > > >> > > I would appreciate your feedback or any hints for future research. >> > > >> > > Regards, >> > > Vladimir. >> > > >> > >> >> >>
