Avery Qi created SPARK-51885:
--------------------------------

             Summary: Add analyzer support for nested correlated subqueries
                 Key: SPARK-51885
                 URL: https://issues.apache.org/jira/browse/SPARK-51885
             Project: Spark
          Issue Type: Sub-task
          Components: Optimizer, SQL
    Affects Versions: 4.1.0
            Reporter: Avery Qi


* Add support for queries containing nested correlations in multi-pass analyzer.
 ** Change the AnalysisContext.outerPlan from LogicalPlan to LogicalPlans, 
containing all the outer plans outer references might refer to.
 ** Change the update AnalysisContext logic in ResolveSubquery.
 ** Change ResolveSubquery to update NestedOuterAttrs when subquery are 
resolved.
 ** Change ResolveAggregateFunction to update NestedOuterAttrs for subquery in 
the having clause.
 ** Change UpdateOuterReferences to update NestedOuterAttrs as well.
 * Add new error types and check analysis methods.
 ** Add new error type {{NESTED_REFERENCES_IN_SUBQUERY_NOT_SUPPORTED}} which 
prompts users to turn on 
{{spark.sql.optimizer.supportNestedCorrelatedSubqueries.enabled}} configs for 
queries containing nested correlations.
 ** Add new check analysis methods to check if the config is turned on for 
queries containing nested correlations.
 ** Add new check analysis methods to ensure main query does not contain 
subqueries with nested outer attrs. (NestedOuterAttrs.nonEmpty means that 
subquery contains outer references can't be resolved in the subquery or the 
containing query of the subquery, but might be resolved in nested outer 
queries. This is not allowed for the main query as it is the outer most query.)

Currently the config is set to false by default as the optimizer changes would 
be in later prs.
And the behavior of lateralSubquery is not changed. We don't allow nested 
correlations in lateralSubquery for now.



Spark only supports one layer of correlation now and does not support nested 
correlation.
For example,
SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 FROM 
VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == MAX(t1.col2)
)GROUP BY col1;
 
is supported and
SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS ( SELECT col1 FROM 
VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == (   SELECT MAX(t1.col2)
 )
)GROUP BY col1;
 
is not supported.

The reason spark does not support it is because the Analyzer and Optimizer 
resolves and plans Subquery in a recursive way.

This task is for adding Analyzer support for queries containing nested 
correlations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to