[jira] [Updated] (SPARK-51885) Add analyzer support for nested correlated subqueries

Avery Qi (Jira) Wed, 23 Apr 2025 12:33:05 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-51885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Avery Qi updated SPARK-51885:
-----------------------------
    Description: 
h2. The task includes:
 * Add support for queries containing nested correlations in multi-pass 
analyzer.
 ** Change the AnalysisContext.outerPlan from LogicalPlan to LogicalPlans, 
containing all the outer plans outer references might refer to.
 ** Change the update AnalysisContext logic in ResolveSubquery.
 ** Change ResolveSubquery to update NestedOuterAttrs when subquery are 
resolved.
 ** Change ResolveAggregateFunction to update NestedOuterAttrs for subquery in 
the having clause.
 ** Change UpdateOuterReferences to update NestedOuterAttrs as well.
 * Add new error types and check analysis methods.
 ** Add new error type {{NESTED_REFERENCES_IN_SUBQUERY_NOT_SUPPORTED}} which 
prompts users to turn on 
{{spark.sql.optimizer.supportNestedCorrelatedSubqueries.enabled}} configs for 
queries containing nested correlations.
 ** Add new check analysis methods to check if the config is turned on for 
queries containing nested correlations.
 ** Add new check analysis methods to ensure main query does not contain 
subqueries with nested outer attrs. (NestedOuterAttrs.nonEmpty means that 
subquery contains outer references can't be resolved in the subquery or the 
containing query of the subquery, but might be resolved in nested outer 
queries. This is not allowed for the main query as it is the outer most query.)

h2. Why is the change needed?

Currently the config is set to false by default as the optimizer changes would 
be in later prs.
And the behavior of lateralSubquery is not changed. We don't allow nested 
correlations in lateralSubquery for now.

Spark only supports one layer of correlation now and does not support nested 
correlation.
For example,
SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 FROM 
VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == MAX(t1.col2)
)GROUP BY col1;
 
is supported and
SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 FROM 
VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == ( SELECT MAX(t1.col2)
)
)GROUP BY col1;
 
is not supported.

The reason spark does not support it is because the Analyzer and Optimizer 
resolves and plans Subquery in a recursive way.

This task is for adding Analyzer support for queries containing nested 
correlations.
h2. High Level Design

[https://docs.google.com/document/d/1EGB48ArLQ04OZvb-zx_VVTJ8roIoCPwSRw4vTVuDY7o/edit?usp=sharing]
 

  was:
* Add support for queries containing nested correlations in multi-pass analyzer.
 ** Change the AnalysisContext.outerPlan from LogicalPlan to LogicalPlans, 
containing all the outer plans outer references might refer to.
 ** Change the update AnalysisContext logic in ResolveSubquery.
 ** Change ResolveSubquery to update NestedOuterAttrs when subquery are 
resolved.
 ** Change ResolveAggregateFunction to update NestedOuterAttrs for subquery in 
the having clause.
 ** Change UpdateOuterReferences to update NestedOuterAttrs as well.
 * Add new error types and check analysis methods.
 ** Add new error type {{NESTED_REFERENCES_IN_SUBQUERY_NOT_SUPPORTED}} which 
prompts users to turn on 
{{spark.sql.optimizer.supportNestedCorrelatedSubqueries.enabled}} configs for 
queries containing nested correlations.
 ** Add new check analysis methods to check if the config is turned on for 
queries containing nested correlations.
 ** Add new check analysis methods to ensure main query does not contain 
subqueries with nested outer attrs. (NestedOuterAttrs.nonEmpty means that 
subquery contains outer references can't be resolved in the subquery or the 
containing query of the subquery, but might be resolved in nested outer 
queries. This is not allowed for the main query as it is the outer most query.)

Currently the config is set to false by default as the optimizer changes would 
be in later prs.
And the behavior of lateralSubquery is not changed. We don't allow nested 
correlations in lateralSubquery for now.



Spark only supports one layer of correlation now and does not support nested 
correlation.
For example,
SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 FROM 
VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == MAX(t1.col2)
)GROUP BY col1;
 
is supported and
SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 FROM 
VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == (   SELECT MAX(t1.col2)
 )
)GROUP BY col1;
 
is not supported.

The reason spark does not support it is because the Analyzer and Optimizer 
resolves and plans Subquery in a recursive way.

This task is for adding Analyzer support for queries containing nested 
correlations.


> Add analyzer support for nested correlated subqueries
> -----------------------------------------------------
>
>                 Key: SPARK-51885
>                 URL: https://issues.apache.org/jira/browse/SPARK-51885
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Optimizer, SQL
>    Affects Versions: 4.1.0
>            Reporter: Avery Qi
>            Priority: Major
>
> h2. The task includes:
>  * Add support for queries containing nested correlations in multi-pass 
> analyzer.
>  ** Change the AnalysisContext.outerPlan from LogicalPlan to LogicalPlans, 
> containing all the outer plans outer references might refer to.
>  ** Change the update AnalysisContext logic in ResolveSubquery.
>  ** Change ResolveSubquery to update NestedOuterAttrs when subquery are 
> resolved.
>  ** Change ResolveAggregateFunction to update NestedOuterAttrs for subquery 
> in the having clause.
>  ** Change UpdateOuterReferences to update NestedOuterAttrs as well.
>  * Add new error types and check analysis methods.
>  ** Add new error type {{NESTED_REFERENCES_IN_SUBQUERY_NOT_SUPPORTED}} which 
> prompts users to turn on 
> {{spark.sql.optimizer.supportNestedCorrelatedSubqueries.enabled}} configs for 
> queries containing nested correlations.
>  ** Add new check analysis methods to check if the config is turned on for 
> queries containing nested correlations.
>  ** Add new check analysis methods to ensure main query does not contain 
> subqueries with nested outer attrs. (NestedOuterAttrs.nonEmpty means that 
> subquery contains outer references can't be resolved in the subquery or the 
> containing query of the subquery, but might be resolved in nested outer 
> queries. This is not allowed for the main query as it is the outer most 
> query.)
> h2. Why is the change needed?
> Currently the config is set to false by default as the optimizer changes 
> would be in later prs.
> And the behavior of lateralSubquery is not changed. We don't allow nested 
> correlations in lateralSubquery for now.
> Spark only supports one layer of correlation now and does not support nested 
> correlation.
> For example,
> SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 
> FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == MAX(t1.col2)
> )GROUP BY col1;
>  
> is supported and
> SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 
> FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == ( SELECT MAX(t1.col2)
> )
> )GROUP BY col1;
>  
> is not supported.
> The reason spark does not support it is because the Analyzer and Optimizer 
> resolves and plans Subquery in a recursive way.
> This task is for adding Analyzer support for queries containing nested 
> correlations.
> h2. High Level Design
> [https://docs.google.com/document/d/1EGB48ArLQ04OZvb-zx_VVTJ8roIoCPwSRw4vTVuDY7o/edit?usp=sharing]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-51885) Add analyzer support for nested correlated subqueries

Reply via email to