Hi Juri, Regarding your solution, I think it will fail for you in case of queries that must have cross join, such as "select * from t1 join t2", so I will be cautious with that. Perhaps you can use "makeHugeCost()" instead of "makeInfiniteCost()".
On Tue, Apr 1, 2025 at 5:00 PM Juri Petersen <j...@apache.org> wrote: > For anyone reading this in the future, here is our approach to a > "solution": > > If you want to discourage the optimizer from generating plans with > cartesian products and multi-conditional joins, a trivial way is to set a > high cost in your custom Join node when matched: > > if (condition.isAlwaysTrue()) { > RelOptCostFactory costFactory = planner.getCostFactory(); > > return costFactory.makeInfiniteCost(); > } > > if (condition.isA(SqlKind.AND)) { > RelOptCostFactory costFactory = planner.getCostFactory(); > > return costFactory.makeInfiniteCost(); > } > > This resolved the problems for us. > > Best, > Juri > > > On 2025/03/31 11:00:58 Alessandro Solimando wrote: > > Hi Zoi, > > in Calcite rules transform a valid (sub-)plan into one or more other > valid > > plans, there are no rules "forbidding" anything that is a legal plan, > they > > just generate alternatives for the AND-OR graph. > > > > As I suggested a few emails before, your best shot to understand what's > > missing to get rid of the last cartesian product, is to enable extra > > debugging and see at what step the sought transformation should/could > > happen, and debug to see why it's not the case. > > > > Extra logging helps to get a hold of the specific rule application number > > or id of the object, as rules get applied recursively many times, so you > > need a conditional breakpoint on one of the two aforementioned conditions > > to be able to reach the relevant rule application. > > > > Best regards, > > Alessandro > > > > On Mon, 31 Mar 2025 at 12:50, Zoi Kaoudi <zkao...@apache.org> wrote: > > > > > Hi all, > > > > > > jumping into the conversation, isn't there a calcite rule that forbids > > > cartesian products? > > > > > > Best > > > -- > > > Zoi > > > > > > On 2025/03/31 07:46:26 Juri Petersen wrote: > > > > Hi, > > > > I tried applying the following rules for my example query: > > > > > > > > final RuleSet wayangRules = RuleSets.ofList( > > > > CoreRules.FILTER_INTO_JOIN, > > > > CoreRules.MULTI_JOIN_OPTIMIZE_BUSHY, > > > > CoreRules.JOIN_COMMUTE, > > > > CoreRules.JOIN_ASSOCIATE > > > > ); > > > > > > > > The input tree looks like this: > > > > > > > > LogicalAggregate(group=[{}], uncredited_voiced_character=[MIN($0)], > > > russian_movie=[MIN($1)]) > > > > LogicalProject(name=[$1], title=[$31]) > > > > LogicalFilter(condition=[AND(LIKE($11, '%(voice)%'), LIKE($11, > > > '%(uncredited)%'), =($16, '[ru]'), =($29, 'actor'), >($34, 2005), > =($30, > > > $24), =($30, $9), =($9, $24), =($0, $10), =($28, $13), =($14, $25), > =($21, > > > $26))]) > > > > LogicalJoin(condition=[true], joinType=[inner]) > > > > LogicalJoin(condition=[true], joinType=[inner]) > > > > LogicalJoin(condition=[true], joinType=[inner]) > > > > LogicalJoin(condition=[true], joinType=[inner]) > > > > LogicalJoin(condition=[true], joinType=[inner]) > > > > LogicalJoin(condition=[true], joinType=[inner]) > > > > LogicalTableScan(table=[[postgres, char_name]]) > > > > LogicalTableScan(table=[[postgres, cast_info]]) > > > > LogicalTableScan(table=[[postgres, company_name]]) > > > > LogicalTableScan(table=[[postgres, company_type]]) > > > > LogicalTableScan(table=[[postgres, movie_companies]]) > > > > LogicalTableScan(table=[[postgres, role_type]]) > > > > LogicalTableScan(table=[[postgres, title]]) > > > > > > > > The resulting converted tree is close to what I desire, however one > > > multi-condition join can't be pushed down, leading to a tree with on > > > cartesian product remaining: > > > > > > > > WayangAggregate(group=[{}], uncredited_voiced_character=[MIN($0)], > > > russian_movie=[MIN($1)]) > > > > WayangProject(name=[$1], title=[$31]) > > > > WayangJoin(condition=[AND(=($24, $30), =($28, $13))], > > > joinType=[inner]) > > > > WayangJoin(condition=[=($9, $24)], joinType=[inner]) > > > > WayangJoin(condition=[=($0, $10)], joinType=[inner]) > > > > WayangTableScan(table=[[postgres, char_name]]) > > > > WayangFilter(condition=[AND(LIKE($4, '%(voice)%'), LIKE($4, > > > '%(uncredited)%'))]) > > > > WayangTableScan(table=[[postgres, cast_info]]) > > > > WayangJoin(condition=[=($0, $11)], joinType=[inner]) > > > > WayangFilter(condition=[=($2, '[ru]')]) > > > > WayangTableScan(table=[[postgres, company_name]]) > > > > WayangJoin(condition=[=($0, $5)], joinType=[inner]) > > > > WayangTableScan(table=[[postgres, company_type]]) > > > > WayangTableScan(table=[[postgres, movie_companies]]) > > > > WayangJoin(condition=[true], joinType=[inner]) > > > > WayangFilter(condition=[=($1, 'actor')]) > > > > WayangTableScan(table=[[postgres, role_type]]) > > > > WayangFilter(condition=[>($4, 2005)]) > > > > WayangTableScan(table=[[postgres, title]]) > > > > > > > > > > > > Looking at my input query, the role_type table has a specified join > > > condition on cast_info. > > > > Am I missing a detail that prevents me from being able to deconstruct > > > the multi-conditional join here? > > > > > > > > Any help would be greatly appreciated. > > > > > > > > Best, > > > > Juri > > > > > > > > On 2025/03/27 10:57:26 Dong Silun wrote: > > > > > Hi Juri, > > > > > As Alessandro said, the Join order prevents the predicates from > being > > > pushed down to the ideal position. > > > > > You can try to use the two rules CoreRules.JOIN_COMMUTE and > > > CoreRules.JOIN_ASSOCIATE instead of the heuristic/dp join reorder > > > algorithm. In the case of all inner joins, CoreRules.JOIN_COMMUTE and > > > CoreRules.JOIN_ASSOCIATE will generate all join order possibilities > (when > > > using VolcanoPlanner), so as to get the join order that can smoothly > push > > > all predicates down to the ideal position (combined with the > FilterIntoJoin > > > rule). > > > > > However, the optimization process may be time-consuming because > there > > > are a total of 7 tables involved in join and the commutative and > > > associative rules are used to enumerate every possibility. > > > > > I didn't actually run your example, I just provided an idea, I > hope it > > > can help you. > > > > > > > > > > Best, > > > > > Silun > > > > > > > > > > ________________________________ > > > > > 发件人: Juri Petersen <j...@apache.org> > > > > > 发送时间: 2025年3月27日 16:40 > > > > > 收件人: dev@calcite.apache.org <dev@calcite.apache.org> > > > > > 主题: Re: FIlterIntoJoinRule applied without complete result > > > > > > > > > > Hi, > > > > > Thank you for your answer! > > > > > I see your point about join ordering, thats also why I tried using > the > > > MULTI_JOIN_OPTIMZE CoreRule before. > > > > > I tried it again just now, and these rules still don't resolve my > > > problem: > > > > > > > > > > final RuleSet rules = RuleSets.ofList( > > > > > CoreRules.FILTER_INTO_JOIN, > > > > > CoreRules.MULTI_JOIN_OPTIMIZE > > > > > ); > > > > > > > > > > I tried both the smart and dumb FILTER_INTO_JOIN and also the bushy > > > version of MULTI_JOIN_OPTIMIZE. > > > > > > > > > > The message I get when trying to optimize the plan is the > following: > > > > > > > > > > org.apache.calcite.plan.RelOptPlanner$CannotPlanException: There > are > > > not enough rules to produce a node with desired properties: > > > convention=NONE. All the inputs have relevant nodes, however the cost > is > > > still infinite. > > > > > Root: rel#55:RelSubset#15.NONE > > > > > Original rel: > > > > > LogicalAggregate(group=[{}], uncredited_voiced_character=[MIN($0)], > > > russian_movie=[MIN($1)]): rowcount = 1.0, cumulative cost = > > > 1.0101010125097225E14, id = 30 > > > > > LogicalProject(name=[$1], title=[$31]): rowcount = > > > 120135.49804687499, cumulative cost = 1.01010101250971E14, id = 29 > > > > > LogicalFilter(condition=[AND(LIKE($11, '%(voice)%'), LIKE($11, > > > '%(uncredited)%'), =($16, '[ru]'), =($29, 'actor'), >($34, 2005), > =($30, > > > $24), =($30, $9), =($9, $24), =($0, $10), =($28, $13), =($14, $25), > =($21, > > > $26))]): rowcount = 120135.49804687499, cumulative cost = > > > 1.010101011308355E14, id = 26 > > > > > LogicalJoin(condition=[true], joinType=[inner]): rowcount = > > > 1.0E14, cumulative cost = 1.010101010107E14, id = 25 > > > > > LogicalJoin(condition=[true], joinType=[inner]): rowcount = > > > 1.0E12, cumulative cost = 1.0101010106E12, id = 21 > > > > > LogicalJoin(condition=[true], joinType=[inner]): > rowcount = > > > 1.0E10, cumulative cost = 1.01010105E10, id = 17 > > > > > LogicalJoin(condition=[true], joinType=[inner]): > rowcount > > > = 1.0E8, cumulative cost = 1.010104E8, id = 13 > > > > > LogicalJoin(condition=[true], joinType=[inner]): > > > rowcount = 1000000.0, cumulative cost = 1010300.0, id = 9 > > > > > LogicalJoin(condition=[true], joinType=[inner]): > > > rowcount = 10000.0, cumulative cost = 10200.0, id = 5 > > > > > LogicalTableScan(table=[[postgres, char_name]]): > > > rowcount = 100.0, cumulative cost = 100.0, id = 1 > > > > > LogicalTableScan(table=[[postgres, cast_info]]): > > > rowcount = 100.0, cumulative cost = 100.0, id = 3 > > > > > LogicalTableScan(table=[[postgres, company_name]]): > > > rowcount = 100.0, cumulative cost = 100.0, id = 7 > > > > > LogicalTableScan(table=[[postgres, company_type]]): > > > rowcount = 100.0, cumulative cost = 100.0, id = 11 > > > > > LogicalTableScan(table=[[postgres, movie_companies]]): > > > rowcount = 100.0, cumulative cost = 100.0, id = 15 > > > > > LogicalTableScan(table=[[postgres, role_type]]): > rowcount = > > > 100.0, cumulative cost = 100.0, id = 19 > > > > > LogicalTableScan(table=[[postgres, title]]): rowcount = > 100.0, > > > cumulative cost = 100.0, id = 23 > > > > > > > > > > > > > > > I hope this specifies my problem a bit more. > > > > > > > > > > Best, > > > > > Juri > > > > > > > > > > On 2025/03/26 13:54:33 Alessandro Solimando wrote: > > > > > > Hi Juri, > > > > > > it's true that the tables in the joins are fully connected via > the > > > > > > predicates, but order matters and the concrete order I see can't > do > > > without > > > > > > cartesian products: it's joining "company_type" with other tables > > > before > > > > > > joining with "movie_companies", but the only predicate in the > where > > > clause > > > > > > around "company_type" is "ct.id = mc.company_type_id", which > can't > > > be used > > > > > > in that subtree as "movie_companies" hasn't been joined yet, so > > > basically > > > > > > it's a join ordering "issue" (which could not be an issue at all > > > based on > > > > > > the size of the tables, selectivity of the predicates etc.). > > > > > > > > > > > > Are you using rules for join ordering like LoptOptimizeJoinRule > > > > > > < > > > > https://github.com/apache/calcite/blob/bfbe8930f4ed7ba8da530e862e212a057191cfa3/core/src/main/java/org/apache/calcite/rel/rules/LoptOptimizeJoinRule.java > > > > > > > > > > in your program (the set of rules you use could help people > provide a > > > > > > better answer)? If you are using 1.39.0 there is a new join > ordering > > > > > > algorithm, you can refer to CALCITE-6846 > > > > > > <https://issues.apache.org/jira/browse/CALCITE-6846> and > related PR > > > for > > > > > > more details which should be exhaustive. > > > > > > > > > > > > If you think you have added all the rules and you can't still > get a > > > sense > > > > > > of why you end up with a particular plan, you can activate the > > > extended > > > > > > logs around rule applications and transformations to be able to > then > > > put > > > > > > breakpoints in the involved rules at the specific step which is > > > generally > > > > > > tricky as rules are called multiple times. You can refer to these > > > slides > > > > > > > > > > https://www.slideshare.net/StamatisZampetakis/debugging-planning-issues-using-calcites-builtin-loggers > > > > > > (there is also the full video and other links at > > > > > > https://calcite.apache.org/community/, the talk is "Debugging > > > planning > > > > > > issues using Calcite’s built in loggers"). > > > > > > > > > > > > Best regards, > > > > > > Alessandro > > > > > > > > > > > > On Wed, 26 Mar 2025 at 11:10, Juri Petersen <j...@itu.dk.invalid > > > > > wrote: > > > > > > > > > > > > > Hi, > > > > > > > As mentioned by Mads in a previous mail, we are working on a > > > SQL-API in > > > > > > > Apache Wayang. > > > > > > > We are trying to set up experiments with the JOB Benchmark and > see > > > that we > > > > > > > have to rewrite queries to explicit INNER JOINS for them to be > > > parsed > > > > > > > correctly. > > > > > > > Since we are planning to do other benchmarks with thousands of > > > queries, > > > > > > > rewriting is not feasible. > > > > > > > > > > > > > > Given this (not-rewritten) query from JOB: > > > > > > > > > > > > > > SELECT MIN(chn.name) AS uncredited_voiced_character, > > > > > > > MIN(t.title) AS russian_movie > > > > > > > FROM postgres.char_name AS chn, > > > > > > > postgres.cast_info AS ci, > > > > > > > postgres.company_name AS cn, > > > > > > > postgres.company_type AS ct, > > > > > > > postgres.movie_companies AS mc, > > > > > > > postgres.role_type AS rt, > > > > > > > postgres.title AS t > > > > > > > WHERE ci.note LIKE '%(voice)%' > > > > > > > AND ci.note LIKE '%(uncredited)%' > > > > > > > AND cn.country_code = '[ru]' > > > > > > > AND rt.role = 'actor' > > > > > > > AND t.production_year > 2005 > > > > > > > AND t.id = mc.movie_id > > > > > > > AND t.id = ci.movie_id > > > > > > > AND ci.movie_id = mc.movie_id > > > > > > > AND chn.id = ci.person_role_id > > > > > > > AND rt.id = ci.role_id > > > > > > > AND cn.id = mc.company_id > > > > > > > AND ct.id = mc.company_type_id; > > > > > > > > > > > > > > We use calcite to get the following tree: > > > > > > > > > > > > > > LogicalAggregate(group=[{}], > uncredited_voiced_character=[MIN($0)], > > > > > > > russian_movie=[MIN($1)]) > > > > > > > LogicalProject(name=[$1], title=[$31]) > > > > > > > LogicalFilter(condition=[AND(LIKE($11, '%(voice)%'), > LIKE($11, > > > > > > > '%(uncredited)%'), =($16, '[ru]'), =($29, 'actor'), >($34, > 2005), > > > =($30, > > > > > > > $24), =($30, $9), =($9, $24), =($0, $10), =($28, $13), =($14, > > > $25), =($21, > > > > > > > $26))]) > > > > > > > LogicalJoin(condition=[true], joinType=[inner]) > > > > > > > LogicalJoin(condition=[true], joinType=[inner]) > > > > > > > LogicalJoin(condition=[true], joinType=[inner]) > > > > > > > LogicalJoin(condition=[true], joinType=[inner]) > > > > > > > LogicalJoin(condition=[true], joinType=[inner]) > > > > > > > LogicalJoin(condition=[true], joinType=[inner]) > > > > > > > LogicalTableScan(table=[[postgres, > char_name]]) > > > > > > > LogicalTableScan(table=[[postgres, > cast_info]]) > > > > > > > LogicalTableScan(table=[[postgres, > company_name]]) > > > > > > > LogicalTableScan(table=[[postgres, > company_type]]) > > > > > > > LogicalTableScan(table=[[postgres, > movie_companies]]) > > > > > > > LogicalTableScan(table=[[postgres, role_type]]) > > > > > > > LogicalTableScan(table=[[postgres, title]]) > > > > > > > > > > > > > > > > > > > > > I then try to apply the CoreRules.FILTER_INTO_JOIN (tried smart > > > and dumb > > > > > > > version), in order to avoid the cartesian products, hoping to > push > > > the join > > > > > > > conditions into the respective LogicalJoins. > > > > > > > Heres the resulting tree: > > > > > > > > > > > > > > LogicalAggregate(group=[{}], > uncredited_voiced_character=[MIN($0)], > > > > > > > russian_movie=[MIN($1)]) > > > > > > > LogicalProject(name=[$1], title=[$31]) > > > > > > > LogicalJoin(condition=[=($24, $30)], joinType=[inner]) > > > > > > > LogicalJoin(condition=[=($28, $13)], joinType=[inner]) > > > > > > > LogicalJoin(condition=[AND(=($9, $24), =($14, $25), > =($21, > > > $26))], > > > > > > > joinType=[inner]) > > > > > > > LogicalJoin(condition=[true], joinType=[inner]) > > > > > > > LogicalJoin(condition=[true], joinType=[inner]) > > > > > > > LogicalJoin(condition=[=($0, $10)], > joinType=[inner]) > > > > > > > LogicalTableScan(table=[[postgres, char_name]]) > > > > > > > LogicalFilter(condition=[AND(LIKE($4, > '%(voice)%'), > > > > > > > LIKE($4, '%(uncredited)%'))]) > > > > > > > LogicalTableScan(table=[[postgres, > cast_info]]) > > > > > > > LogicalFilter(condition=[=($2, '[ru]')]) > > > > > > > LogicalTableScan(table=[[postgres, > company_name]]) > > > > > > > LogicalTableScan(table=[[postgres, company_type]]) > > > > > > > LogicalTableScan(table=[[postgres, movie_companies]]) > > > > > > > LogicalFilter(condition=[=($1, 'actor')]) > > > > > > > LogicalTableScan(table=[[postgres, role_type]]) > > > > > > > LogicalFilter(condition=[>($4, 2005)]) > > > > > > > LogicalTableScan(table=[[postgres, title]]) > > > > > > > > > > > > > > Some of the conditions are pushed down, but we still have > remaining > > > > > > > cartesian products and a multi-condition join. > > > > > > > Looking at the input query, I would expect every Join to have a > > > condition, > > > > > > > as there are no unspecified joins, right? > > > > > > > What am I missing or what can we do to deconstruct the > > > multi-conditional > > > > > > > join and avoid cartesian products? > > > > > > > > > > > > > > Thanks in advance for any help! > > > > > > > > > > > > > > Best, > > > > > > > Juri > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >