[ https://issues.apache.org/jira/browse/SPARK-51885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated SPARK-51885: ----------------------------------- Labels: pull-request-available (was: ) > Add analyzer support for nested correlated subqueries > ----------------------------------------------------- > > Key: SPARK-51885 > URL: https://issues.apache.org/jira/browse/SPARK-51885 > Project: Spark > Issue Type: Sub-task > Components: Optimizer, SQL > Affects Versions: 4.1.0 > Reporter: Avery Qi > Priority: Major > Labels: pull-request-available > > h2. The task includes: > * Add support for queries containing nested correlations in multi-pass > analyzer. > ** Change the AnalysisContext.outerPlan from LogicalPlan to LogicalPlans, > containing all the outer plans outer references might refer to. > ** Change the update AnalysisContext logic in ResolveSubquery. > ** Change ResolveSubquery to update NestedOuterAttrs when subquery are > resolved. > ** Change ResolveAggregateFunction to update NestedOuterAttrs for subquery > in the having clause. > ** Change UpdateOuterReferences to update NestedOuterAttrs as well. > * Add new error types and check analysis methods. > ** Add new error type {{NESTED_REFERENCES_IN_SUBQUERY_NOT_SUPPORTED}} which > prompts users to turn on > {{spark.sql.optimizer.supportNestedCorrelatedSubqueries.enabled}} configs for > queries containing nested correlations. > ** Add new check analysis methods to check if the config is turned on for > queries containing nested correlations. > ** Add new check analysis methods to ensure main query does not contain > subqueries with nested outer attrs. (NestedOuterAttrs.nonEmpty means that > subquery contains outer references can't be resolved in the subquery or the > containing query of the subquery, but might be resolved in nested outer > queries. This is not allowed for the main query as it is the outer most > query.) > h2. Why is the change needed? > Currently the config is set to false by default as the optimizer changes > would be in later prs. > And the behavior of lateralSubquery is not changed. We don't allow nested > correlations in lateralSubquery for now. > Spark only supports one layer of correlation now and does not support nested > correlation. > For example, > SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 > FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == MAX(t1.col2) > )GROUP BY col1; > > is supported and > SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 > FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == ( SELECT MAX(t1.col2) > ) > )GROUP BY col1; > > is not supported. > The reason spark does not support it is because the Analyzer and Optimizer > resolves and plans Subquery in a recursive way. > This task is for adding Analyzer support for queries containing nested > correlations. > h2. High Level Design > [https://docs.google.com/document/d/1EGB48ArLQ04OZvb-zx_VVTJ8roIoCPwSRw4vTVuDY7o/edit?usp=sharing] > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org