Re: [PR] [SPARK-50983][SQL]Part 1.a Add outer scope attributes for SubqueryExpression [spark]

via GitHub Tue, 22 Apr 2025 11:07:53 -0700


AveryQi115 commented on code in PR #50285:
URL: https://github.com/apache/spark/pull/50285#discussion_r2054604668



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/FunctionTableSubqueryArgumentExpression.scala:
##########
@@ -46,6 +46,10 @@ import org.apache.spark.sql.types.DataType
  *             relation or as a more complex logical plan in the event of a 
table subquery.
  * @param outerAttrs outer references of this subquery plan, generally empty 
since these table
  *                   arguments do not allow correlated references currently
+ * @param outerScopeAttrs outer references of the subquery plan that cannot be 
resolved by the

Review Comment:
   Two reasons:
   1. `SubqueryExpression.references` are defined as `outerAttrs`. And this 
reference is used in many places in the spark planner/optimizer. We check if 
the references can be resolved in the containing operator of the subquery's 
input. If not, the operator/subquery becomes unresolved. outerScopeAttrs need 
to be removed from these references as they cannot be resolved by the 
operator's input. So we need to have this metadata and change the reference of 
subqueryExpression to be `AttributeSet(outerAttrs) -- 
AttributeSet(nestedOuterAttrs)`. It is changed in  the part1.b pr.
   
   2. For safely adding nested correlations support in the optimizer.
   This is due to the safety concern and some legacy reasons of the optimizer 
design.
   
   The decorrelation framework in the optimizer now supports one layer of 
decorrelation, and it is not designed for nested correlations. Changing it to 
support nested correlations would be hard, but completely remove it and replace 
it by the nested correlations handling framework might affect current spark 
users.
   
   For safely adding this new feature, we want to maintain two decorrelation 
frameworks now, they're a bit similar so the maintenance work would be easy. 
And whether the subquery contains outerScopeAttrs guides the optimizer to 
choose different decorrelations.
   
   It is very hard to determine whether an outer reference can be resolved in 
the containing query or is a outer scope outer reference. Because due to some 
existing bugs of DeduplicateRelations and InlineCTE, we might have duplicated 
exprIds accross subquery plans in the optimizer. Optimizer cannot get the 
correct information about where this outer reference comes from. So we need 
this metadata in the analyzer phase.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50983][SQL]Part 1.a Add outer scope attributes for SubqueryExpression [spark]

Reply via email to