jayzhan211 commented on PR #15281: URL: https://github.com/apache/datafusion/pull/15281#issuecomment-2769722147
The projection required to be in the group expression. I think the query of these 2 are equivalent but the subquery one group by `e2.b` and the join query group by `e1.b`. Not sure if this rewrite could be general enough 🤔 ``` query IT select e1.b, (select case when max(e2.a) > 10 then 'a' else 'b' end from t2 e2 where e2.b = e1.b + 1) from t1 e1; ---- 0 a 2 a query TT explain select e1.b, (select case when max(e2.a) > 10 then 'a' else 'b' end from t2 e2 where e2.b = e1.b + 1) from t1 e1; ---- logical_plan 01)Projection: e1.b, CASE WHEN __scalar_sq_1.__always_true IS NULL THEN Utf8("b") ELSE __scalar_sq_1.CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END END AS CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END 02)--Left Join: CAST(e1.b AS Int64) + Int64(1) = CAST(__scalar_sq_1.b AS Int64) 03)----SubqueryAlias: e1 04)------TableScan: t1 projection=[b] 05)----SubqueryAlias: __scalar_sq_1 06)------Projection: CASE WHEN max(e2.a) > Int32(10) THEN Utf8("a") ELSE Utf8("b") END AS CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END, e2.b, Boolean(true) AS __always_true 07)--------Aggregate: groupBy=[[e2.b]], aggr=[[max(e2.a)]] 08)----------SubqueryAlias: e2 09)------------TableScan: t2 projection=[a, b] physical_plan 01)ProjectionExec: expr=[b@0 as b, CASE WHEN __always_true@2 IS NULL THEN b ELSE CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END@1 END as CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END] 02)--CoalesceBatchesExec: target_batch_size=8192 03)----HashJoinExec: mode=Partitioned, join_type=Left, on=[(e1.b + Int64(1)@1, CAST(__scalar_sq_1.b AS Int64)@3)], projection=[b@0, CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END@2, __always_true@4] 04)------CoalesceBatchesExec: target_batch_size=8192 05)--------RepartitionExec: partitioning=Hash([e1.b + Int64(1)@1], 4), input_partitions=1 06)----------ProjectionExec: expr=[b@0 as b, CAST(b@0 AS Int64) + 1 as e1.b + Int64(1)] 07)------------DataSourceExec: partitions=1, partition_sizes=[1] 08)------CoalesceBatchesExec: target_batch_size=8192 09)--------RepartitionExec: partitioning=Hash([CAST(__scalar_sq_1.b AS Int64)@3], 4), input_partitions=4 10)----------ProjectionExec: expr=[CASE WHEN max(e2.a)@1 > 10 THEN a ELSE b END as CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END, b@0 as b, true as __always_true, CAST(b@0 AS Int64) as CAST(__scalar_sq_1.b AS Int64)] 11)------------AggregateExec: mode=FinalPartitioned, gby=[b@0 as b], aggr=[max(e2.a)] 12)--------------CoalesceBatchesExec: target_batch_size=8192 13)----------------RepartitionExec: partitioning=Hash([b@0], 4), input_partitions=4 14)------------------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 15)--------------------AggregateExec: mode=Partial, gby=[b@1 as b], aggr=[max(e2.a)] 16)----------------------DataSourceExec: partitions=1, partition_sizes=[1] query IT SELECT e1.b, CASE WHEN MAX(e2.a) > 10 THEN 'a' ELSE 'b' END AS result FROM t2 e2 LEFT JOIN t1 e1 ON e2.b = e1.b + 1 GROUP BY e1.b; ---- 2 a 0 a query TT explain SELECT e1.b, CASE WHEN MAX(e2.a) > 10 THEN 'a' ELSE 'b' END AS result FROM t2 e2 LEFT JOIN t1 e1 ON e2.b = e1.b + 1 GROUP BY e1.b; ---- logical_plan 01)Projection: e1.b, CASE WHEN max(e2.a) > Int32(10) THEN Utf8("a") ELSE Utf8("b") END AS result 02)--Aggregate: groupBy=[[e1.b]], aggr=[[max(e2.a)]] 03)----Projection: e2.a, e1.b 04)------Left Join: CAST(e2.b AS Int64) = CAST(e1.b AS Int64) + Int64(1) 05)--------SubqueryAlias: e2 06)----------TableScan: t2 projection=[a, b] 07)--------SubqueryAlias: e1 08)----------TableScan: t1 projection=[b] physical_plan 01)ProjectionExec: expr=[b@0 as b, CASE WHEN max(e2.a)@1 > 10 THEN a ELSE b END as result] 02)--AggregateExec: mode=FinalPartitioned, gby=[b@0 as b], aggr=[max(e2.a)] 03)----CoalesceBatchesExec: target_batch_size=8192 04)------RepartitionExec: partitioning=Hash([b@0], 4), input_partitions=4 05)--------AggregateExec: mode=Partial, gby=[b@1 as b], aggr=[max(e2.a)] 06)----------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 07)------------ProjectionExec: expr=[a@1 as a, b@0 as b] 08)--------------CoalesceBatchesExec: target_batch_size=8192 09)----------------HashJoinExec: mode=Partitioned, join_type=Right, on=[(e1.b + Int64(1)@1, CAST(e2.b AS Int64)@2)], projection=[b@0, a@2] 10)------------------ProjectionExec: expr=[b@0 as b, CAST(b@0 AS Int64) + 1 as e1.b + Int64(1)] 11)--------------------DataSourceExec: partitions=1, partition_sizes=[1] 12)------------------ProjectionExec: expr=[a@0 as a, b@1 as b, CAST(b@1 AS Int64) as CAST(e2.b AS Int64)] 13)--------------------DataSourceExec: partitions=1, partition_sizes=[1] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org