[jira] [Updated] (SPARK-51884) Subquery Definition Changes For Adding Support For Nested Correlated Subqueries

Avery Qi (Jira) Wed, 23 Apr 2025 11:53:50 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-51884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Avery Qi updated SPARK-51884:
-----------------------------
    Description: 
Newly added argument for SubqueryExpression: OuterScopeAttrs.

`OuterScopeAttrs` contains attributes cannot be resolved in the containing 
query of the subquery or the subquery itself, but can be resolved in the whole 
query plan.
h2. The task design includes:
 * Add `OuterScopeAttrs` and related getter and setter methods for 
SubqueryExpression
 * All attributes in `OuterScopeAttrs` must be contained in the `OuterAttrs` 
AttributeSet of SubqueryExpression
 * Update the usage of SubqueryExpression and classes extending 
SubqueryExpression
 * Update SubqueryExpression.references to be `AttributeSet(outerAttrs) – 
AttributeSet(outerScopeAttrs)`

h2. Why we need the change?

Spark only supports one layer of correlation now and does not support nested 
correlation.
For example,
SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 FROM 
VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == MAX(t1.col2)
)GROUP BY col1;
 
is supported and
SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 FROM 
VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == ( SELECT MAX(t1.col2)
)
)GROUP BY col1;
 
is not supported.

The reason spark does not support it is because the Analyzer and Optimizer 
resolves and plans Subquery in a recursive way.

The definition change for the SubqueryExpression adds the metadata 
OuterScopeAttrs which helps later rewrites for the Analyzer and Optimizer.
h2. High Level Design

[https://docs.google.com/document/d/1EGB48ArLQ04OZvb-zx_VVTJ8roIoCPwSRw4vTVuDY7o/edit?usp=sharing]
 

  was:
Newly added argument for SubqueryExpression: OuterScopeAttrs.

`OuterScopeAttrs` contains attributes cannot be resolved in the containing 
query of the subquery or the subquery itself, but can be resolved in the whole 
query plan.
h2. The task design includes:
 * Add `OuterScopeAttrs` and related getter and setter methods for 
SubqueryExpression
 * All attributes in `OuterScopeAttrs` must be contained in the `OuterAttrs` 
AttributeSet of SubqueryExpression
 * Update the usage of SubqueryExpression and classes extending 
SubqueryExpression
 * Update SubqueryExpression.references to be `AttributeSet(outerAttrs) – 
AttributeSet(outerScopeAttrs)`

h2. Why we need the change?

Spark only supports one layer of correlation now and does not support nested 
correlation.
For example,
SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 FROM 
VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == MAX(t1.col2)
)GROUP BY col1;
 
is supported and
SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 FROM 
VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == ( SELECT MAX(t1.col2)
)
)GROUP BY col1;
 
is not supported.

The reason spark does not support it is because the Analyzer and Optimizer 
resolves and plans Subquery in a recursive way.

The definition change for the SubqueryExpression adds the metadata 
OuterScopeAttrs which helps later rewrites for the Analyzer and Optimizer.


> Subquery Definition Changes For Adding Support For Nested Correlated 
> Subqueries
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-51884
>                 URL: https://issues.apache.org/jira/browse/SPARK-51884
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Optimizer, SQL
>    Affects Versions: 4.1.0
>            Reporter: Avery Qi
>            Priority: Major
>
> Newly added argument for SubqueryExpression: OuterScopeAttrs.
> `OuterScopeAttrs` contains attributes cannot be resolved in the containing 
> query of the subquery or the subquery itself, but can be resolved in the 
> whole query plan.
> h2. The task design includes:
>  * Add `OuterScopeAttrs` and related getter and setter methods for 
> SubqueryExpression
>  * All attributes in `OuterScopeAttrs` must be contained in the `OuterAttrs` 
> AttributeSet of SubqueryExpression
>  * Update the usage of SubqueryExpression and classes extending 
> SubqueryExpression
>  * Update SubqueryExpression.references to be `AttributeSet(outerAttrs) – 
> AttributeSet(outerScopeAttrs)`
> h2. Why we need the change?
> Spark only supports one layer of correlation now and does not support nested 
> correlation.
> For example,
> SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 
> FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == MAX(t1.col2)
> )GROUP BY col1;
>  
> is supported and
> SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 
> FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == ( SELECT MAX(t1.col2)
> )
> )GROUP BY col1;
>  
> is not supported.
> The reason spark does not support it is because the Analyzer and Optimizer 
> resolves and plans Subquery in a recursive way.
> The definition change for the SubqueryExpression adds the metadata 
> OuterScopeAttrs which helps later rewrites for the Analyzer and Optimizer.
> h2. High Level Design
> [https://docs.google.com/document/d/1EGB48ArLQ04OZvb-zx_VVTJ8roIoCPwSRw4vTVuDY7o/edit?usp=sharing]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-51884) Subquery Definition Changes For Adding Support For Nested Correlated Subqueries

Reply via email to