Avery Qi created SPARK-51885: -------------------------------- Summary: Add analyzer support for nested correlated subqueries Key: SPARK-51885 URL: https://issues.apache.org/jira/browse/SPARK-51885 Project: Spark Issue Type: Sub-task Components: Optimizer, SQL Affects Versions: 4.1.0 Reporter: Avery Qi
* Add support for queries containing nested correlations in multi-pass analyzer. ** Change the AnalysisContext.outerPlan from LogicalPlan to LogicalPlans, containing all the outer plans outer references might refer to. ** Change the update AnalysisContext logic in ResolveSubquery. ** Change ResolveSubquery to update NestedOuterAttrs when subquery are resolved. ** Change ResolveAggregateFunction to update NestedOuterAttrs for subquery in the having clause. ** Change UpdateOuterReferences to update NestedOuterAttrs as well. * Add new error types and check analysis methods. ** Add new error type {{NESTED_REFERENCES_IN_SUBQUERY_NOT_SUPPORTED}} which prompts users to turn on {{spark.sql.optimizer.supportNestedCorrelatedSubqueries.enabled}} configs for queries containing nested correlations. ** Add new check analysis methods to check if the config is turned on for queries containing nested correlations. ** Add new check analysis methods to ensure main query does not contain subqueries with nested outer attrs. (NestedOuterAttrs.nonEmpty means that subquery contains outer references can't be resolved in the subquery or the containing query of the subquery, but might be resolved in nested outer queries. This is not allowed for the main query as it is the outer most query.) Currently the config is set to false by default as the optimizer changes would be in later prs. And the behavior of lateralSubquery is not changed. We don't allow nested correlations in lateralSubquery for now. Spark only supports one layer of correlation now and does not support nested correlation. For example, SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == MAX(t1.col2) )GROUP BY col1; is supported and SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == ( SELECT MAX(t1.col2) ) )GROUP BY col1; is not supported. The reason spark does not support it is because the Analyzer and Optimizer resolves and plans Subquery in a recursive way. This task is for adding Analyzer support for queries containing nested correlations. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org