[ https://issues.apache.org/jira/browse/SPARK-51885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Avery Qi updated SPARK-51885: ----------------------------- Description: h2. The task includes: * Add support for queries containing nested correlations in multi-pass analyzer. ** Change the AnalysisContext.outerPlan from LogicalPlan to LogicalPlans, containing all the outer plans outer references might refer to. ** Change the update AnalysisContext logic in ResolveSubquery. ** Change ResolveSubquery to update NestedOuterAttrs when subquery are resolved. ** Change ResolveAggregateFunction to update NestedOuterAttrs for subquery in the having clause. ** Change UpdateOuterReferences to update NestedOuterAttrs as well. * Add new error types and check analysis methods. ** Add new error type {{NESTED_REFERENCES_IN_SUBQUERY_NOT_SUPPORTED}} which prompts users to turn on {{spark.sql.optimizer.supportNestedCorrelatedSubqueries.enabled}} configs for queries containing nested correlations. ** Add new check analysis methods to check if the config is turned on for queries containing nested correlations. ** Add new check analysis methods to ensure main query does not contain subqueries with nested outer attrs. (NestedOuterAttrs.nonEmpty means that subquery contains outer references can't be resolved in the subquery or the containing query of the subquery, but might be resolved in nested outer queries. This is not allowed for the main query as it is the outer most query.) h2. Why is the change needed? Currently the config is set to false by default as the optimizer changes would be in later prs. And the behavior of lateralSubquery is not changed. We don't allow nested correlations in lateralSubquery for now. Spark only supports one layer of correlation now and does not support nested correlation. For example, SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == MAX(t1.col2) )GROUP BY col1; is supported and SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == ( SELECT MAX(t1.col2) ) )GROUP BY col1; is not supported. The reason spark does not support it is because the Analyzer and Optimizer resolves and plans Subquery in a recursive way. This task is for adding Analyzer support for queries containing nested correlations. h2. High Level Design [https://docs.google.com/document/d/1EGB48ArLQ04OZvb-zx_VVTJ8roIoCPwSRw4vTVuDY7o/edit?usp=sharing] was: * Add support for queries containing nested correlations in multi-pass analyzer. ** Change the AnalysisContext.outerPlan from LogicalPlan to LogicalPlans, containing all the outer plans outer references might refer to. ** Change the update AnalysisContext logic in ResolveSubquery. ** Change ResolveSubquery to update NestedOuterAttrs when subquery are resolved. ** Change ResolveAggregateFunction to update NestedOuterAttrs for subquery in the having clause. ** Change UpdateOuterReferences to update NestedOuterAttrs as well. * Add new error types and check analysis methods. ** Add new error type {{NESTED_REFERENCES_IN_SUBQUERY_NOT_SUPPORTED}} which prompts users to turn on {{spark.sql.optimizer.supportNestedCorrelatedSubqueries.enabled}} configs for queries containing nested correlations. ** Add new check analysis methods to check if the config is turned on for queries containing nested correlations. ** Add new check analysis methods to ensure main query does not contain subqueries with nested outer attrs. (NestedOuterAttrs.nonEmpty means that subquery contains outer references can't be resolved in the subquery or the containing query of the subquery, but might be resolved in nested outer queries. This is not allowed for the main query as it is the outer most query.) Currently the config is set to false by default as the optimizer changes would be in later prs. And the behavior of lateralSubquery is not changed. We don't allow nested correlations in lateralSubquery for now. Spark only supports one layer of correlation now and does not support nested correlation. For example, SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == MAX(t1.col2) )GROUP BY col1; is supported and SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == ( SELECT MAX(t1.col2) ) )GROUP BY col1; is not supported. The reason spark does not support it is because the Analyzer and Optimizer resolves and plans Subquery in a recursive way. This task is for adding Analyzer support for queries containing nested correlations. > Add analyzer support for nested correlated subqueries > ----------------------------------------------------- > > Key: SPARK-51885 > URL: https://issues.apache.org/jira/browse/SPARK-51885 > Project: Spark > Issue Type: Sub-task > Components: Optimizer, SQL > Affects Versions: 4.1.0 > Reporter: Avery Qi > Priority: Major > > h2. The task includes: > * Add support for queries containing nested correlations in multi-pass > analyzer. > ** Change the AnalysisContext.outerPlan from LogicalPlan to LogicalPlans, > containing all the outer plans outer references might refer to. > ** Change the update AnalysisContext logic in ResolveSubquery. > ** Change ResolveSubquery to update NestedOuterAttrs when subquery are > resolved. > ** Change ResolveAggregateFunction to update NestedOuterAttrs for subquery > in the having clause. > ** Change UpdateOuterReferences to update NestedOuterAttrs as well. > * Add new error types and check analysis methods. > ** Add new error type {{NESTED_REFERENCES_IN_SUBQUERY_NOT_SUPPORTED}} which > prompts users to turn on > {{spark.sql.optimizer.supportNestedCorrelatedSubqueries.enabled}} configs for > queries containing nested correlations. > ** Add new check analysis methods to check if the config is turned on for > queries containing nested correlations. > ** Add new check analysis methods to ensure main query does not contain > subqueries with nested outer attrs. (NestedOuterAttrs.nonEmpty means that > subquery contains outer references can't be resolved in the subquery or the > containing query of the subquery, but might be resolved in nested outer > queries. This is not allowed for the main query as it is the outer most > query.) > h2. Why is the change needed? > Currently the config is set to false by default as the optimizer changes > would be in later prs. > And the behavior of lateralSubquery is not changed. We don't allow nested > correlations in lateralSubquery for now. > Spark only supports one layer of correlation now and does not support nested > correlation. > For example, > SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 > FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == MAX(t1.col2) > )GROUP BY col1; > > is supported and > SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 > FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == ( SELECT MAX(t1.col2) > ) > )GROUP BY col1; > > is not supported. > The reason spark does not support it is because the Analyzer and Optimizer > resolves and plans Subquery in a recursive way. > This task is for adding Analyzer support for queries containing nested > correlations. > h2. High Level Design > [https://docs.google.com/document/d/1EGB48ArLQ04OZvb-zx_VVTJ8roIoCPwSRw4vTVuDY7o/edit?usp=sharing] > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org