[ 
https://issues.apache.org/jira/browse/SPARK-51885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-51885:
-----------------------------------
    Labels: pull-request-available  (was: )

> Add analyzer support for nested correlated subqueries
> -----------------------------------------------------
>
>                 Key: SPARK-51885
>                 URL: https://issues.apache.org/jira/browse/SPARK-51885
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Optimizer, SQL
>    Affects Versions: 4.1.0
>            Reporter: Avery Qi
>            Priority: Major
>              Labels: pull-request-available
>
> h2. The task includes:
>  * Add support for queries containing nested correlations in multi-pass 
> analyzer.
>  ** Change the AnalysisContext.outerPlan from LogicalPlan to LogicalPlans, 
> containing all the outer plans outer references might refer to.
>  ** Change the update AnalysisContext logic in ResolveSubquery.
>  ** Change ResolveSubquery to update NestedOuterAttrs when subquery are 
> resolved.
>  ** Change ResolveAggregateFunction to update NestedOuterAttrs for subquery 
> in the having clause.
>  ** Change UpdateOuterReferences to update NestedOuterAttrs as well.
>  * Add new error types and check analysis methods.
>  ** Add new error type {{NESTED_REFERENCES_IN_SUBQUERY_NOT_SUPPORTED}} which 
> prompts users to turn on 
> {{spark.sql.optimizer.supportNestedCorrelatedSubqueries.enabled}} configs for 
> queries containing nested correlations.
>  ** Add new check analysis methods to check if the config is turned on for 
> queries containing nested correlations.
>  ** Add new check analysis methods to ensure main query does not contain 
> subqueries with nested outer attrs. (NestedOuterAttrs.nonEmpty means that 
> subquery contains outer references can't be resolved in the subquery or the 
> containing query of the subquery, but might be resolved in nested outer 
> queries. This is not allowed for the main query as it is the outer most 
> query.)
> h2. Why is the change needed?
> Currently the config is set to false by default as the optimizer changes 
> would be in later prs.
> And the behavior of lateralSubquery is not changed. We don't allow nested 
> correlations in lateralSubquery for now.
> Spark only supports one layer of correlation now and does not support nested 
> correlation.
> For example,
> SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 
> FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == MAX(t1.col2)
> )GROUP BY col1;
>  
> is supported and
> SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 
> FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == ( SELECT MAX(t1.col2)
> )
> )GROUP BY col1;
>  
> is not supported.
> The reason spark does not support it is because the Analyzer and Optimizer 
> resolves and plans Subquery in a recursive way.
> This task is for adding Analyzer support for queries containing nested 
> correlations.
> h2. High Level Design
> [https://docs.google.com/document/d/1EGB48ArLQ04OZvb-zx_VVTJ8roIoCPwSRw4vTVuDY7o/edit?usp=sharing]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to