[ https://issues.apache.org/jira/browse/SPARK-51884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Avery Qi updated SPARK-51884: ----------------------------- Description: Newly added argument for SubqueryExpression: OuterScopeAttrs. `OuterScopeAttrs` contains attributes cannot be resolved in the containing query of the subquery or the subquery itself, but can be resolved in the whole query plan. h2. The task design includes: * Add `OuterScopeAttrs` and related getter and setter methods for SubqueryExpression * All attributes in `OuterScopeAttrs` must be contained in the `OuterAttrs` AttributeSet of SubqueryExpression * Update the usage of SubqueryExpression and classes extending SubqueryExpression * Update SubqueryExpression.references to be `AttributeSet(outerAttrs) – AttributeSet(outerScopeAttrs)` h2. Why we need the change? Spark only supports one layer of correlation now and does not support nested correlation. For example, SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == MAX(t1.col2) )GROUP BY col1; is supported and SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == ( SELECT MAX(t1.col2) ) )GROUP BY col1; is not supported. The reason spark does not support it is because the Analyzer and Optimizer resolves and plans Subquery in a recursive way. The definition change for the SubqueryExpression adds the metadata OuterScopeAttrs which helps later rewrites for the Analyzer and Optimizer. h2. High Level Design [https://docs.google.com/document/d/1EGB48ArLQ04OZvb-zx_VVTJ8roIoCPwSRw4vTVuDY7o/edit?usp=sharing] was: Newly added argument for SubqueryExpression: OuterScopeAttrs. `OuterScopeAttrs` contains attributes cannot be resolved in the containing query of the subquery or the subquery itself, but can be resolved in the whole query plan. h2. The task design includes: * Add `OuterScopeAttrs` and related getter and setter methods for SubqueryExpression * All attributes in `OuterScopeAttrs` must be contained in the `OuterAttrs` AttributeSet of SubqueryExpression * Update the usage of SubqueryExpression and classes extending SubqueryExpression * Update SubqueryExpression.references to be `AttributeSet(outerAttrs) – AttributeSet(outerScopeAttrs)` h2. Why we need the change? Spark only supports one layer of correlation now and does not support nested correlation. For example, SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == MAX(t1.col2) )GROUP BY col1; is supported and SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == ( SELECT MAX(t1.col2) ) )GROUP BY col1; is not supported. The reason spark does not support it is because the Analyzer and Optimizer resolves and plans Subquery in a recursive way. The definition change for the SubqueryExpression adds the metadata OuterScopeAttrs which helps later rewrites for the Analyzer and Optimizer. > Subquery Definition Changes For Adding Support For Nested Correlated > Subqueries > ------------------------------------------------------------------------------- > > Key: SPARK-51884 > URL: https://issues.apache.org/jira/browse/SPARK-51884 > Project: Spark > Issue Type: Sub-task > Components: Optimizer, SQL > Affects Versions: 4.1.0 > Reporter: Avery Qi > Priority: Major > > Newly added argument for SubqueryExpression: OuterScopeAttrs. > `OuterScopeAttrs` contains attributes cannot be resolved in the containing > query of the subquery or the subquery itself, but can be resolved in the > whole query plan. > h2. The task design includes: > * Add `OuterScopeAttrs` and related getter and setter methods for > SubqueryExpression > * All attributes in `OuterScopeAttrs` must be contained in the `OuterAttrs` > AttributeSet of SubqueryExpression > * Update the usage of SubqueryExpression and classes extending > SubqueryExpression > * Update SubqueryExpression.references to be `AttributeSet(outerAttrs) – > AttributeSet(outerScopeAttrs)` > h2. Why we need the change? > Spark only supports one layer of correlation now and does not support nested > correlation. > For example, > SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 > FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == MAX(t1.col2) > )GROUP BY col1; > > is supported and > SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 > FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == ( SELECT MAX(t1.col2) > ) > )GROUP BY col1; > > is not supported. > The reason spark does not support it is because the Analyzer and Optimizer > resolves and plans Subquery in a recursive way. > The definition change for the SubqueryExpression adds the metadata > OuterScopeAttrs which helps later rewrites for the Analyzer and Optimizer. > h2. High Level Design > [https://docs.google.com/document/d/1EGB48ArLQ04OZvb-zx_VVTJ8roIoCPwSRw4vTVuDY7o/edit?usp=sharing] > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org