Re: [PR] [SPARK-51964][SQL] Correctly resolve attributes from hidden output in ORDER BY and HAVING on top of an Aggregate in single-pass Analyzer [spark]

via GitHub Wed, 30 Apr 2025 06:58:55 -0700


mihailoale-db commented on code in PR #50769:
URL: https://github.com/apache/spark/pull/50769#discussion_r2068621838



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/AggregateResolver.scala:
##########
@@ -329,4 +332,17 @@ class AggregateResolver(operatorResolver: Resolver, 
expressionResolver: Expressi
       .candidates
       .isEmpty
   }
+
+  private def getGroupungAttributeIds(aggregate: Aggregate): HashSet[ExprId] = 
{
+    val groupingAttributeIds = new 
HashSet[ExprId](aggregate.groupingExpressions.size)
+    aggregate.groupingExpressions.foreach { rootExpression =>
+      rootExpression.foreach {
+        case attribute: AttributeReference =>
+          groupingAttributeIds.add(attribute.exprId)
+        case _ =>
+      }
+    }
+
+    groupingAttributeIds
+  }

Review Comment:
   Is there a reason why we introduce another pass to collect those attributes?
   This could easily be done in single-pass manner (by adding resolved 
attributes from `ExpressionResolver` to some HashSet if 
`resolvingGroupingExpressions` flag is set in current 
`ExpressionResolutionContext`).



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/NameScope.scala:
##########
@@ -403,6 +414,22 @@ class NameScope(
    *
    * {{ SELECT col1 + col2 AS a FROM VALUES (1, 2) GROUP BY a; }}}
    *
+   * In case we are resolving names in expression trees from HAVING or ORDER 
BY on top of
+   * [[Aggregate]], we are able to resolve hidden attributes only if theose 
are present in

Review Comment:
   either 
   ```suggestion
      * In case we are resolving names in expression trees from [[Having]] or 
[[Sort]] on top of
      * [[Aggregate]], we are able to resolve hidden attributes only if theose 
are present in
   ```
   or
   ```suggestion
      * In case we are resolving names in expression trees from HAVING or ORDER 
BY on top of
      * GROUP BY, we are able to resolve hidden attributes only if theose are 
present in
   ```
   sound better (for consistency).



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/NameScope.scala:
##########
@@ -712,13 +745,21 @@ class NameScopeStack extends SQLConfHelper {
    *  was nullable in either old hidden output or new output, it must stay 
nullable in new hidden
    *  output as well.
    */
-  def overwriteOutputAndExtendHiddenOutput(output: Seq[Attribute]): Unit = {
+  def overwriteOutputAndExtendHiddenOutput(

Review Comment:
   Please update scala doc here (instead of `updates nullabilities` it should 
be `updates properties` and a bit more than that).



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/NameScope.scala:
##########
@@ -403,6 +414,22 @@ class NameScope(
    *
    * {{ SELECT col1 + col2 AS a FROM VALUES (1, 2) GROUP BY a; }}}
    *
+   * In case we are resolving names in expression trees from HAVING or ORDER 
BY on top of
+   * [[Aggregate]], we are able to resolve hidden attributes only if theose 
are present in

Review Comment:
   typo
   ```suggestion
      * In case we are resolving names in expression trees from HAVING or ORDER 
BY on top of
      * [[Aggregate]], we are able to resolve hidden attributes only if those 
are present in
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/NameScope.scala:
##########
@@ -848,12 +889,14 @@ class NameScopeStack extends SQLConfHelper {
       multipartName: Seq[String],
       canLaterallyReferenceColumn: Boolean = true,
       canReferenceAggregateExpressionAliases: Boolean = false,
-      canResolveNameByHiddenOutput: Boolean = false): NameTarget = {
+      canResolveNameByHiddenOutput: Boolean = false,
+      canReferenceAggregatedAccessOnlyAttributes: Boolean = false): NameTarget 
= {

Review Comment:
   Could you add a small example here that demonstrates aggregated access only 
behavior?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/NameScope.scala:
##########
@@ -403,6 +414,22 @@ class NameScope(
    *
    * {{ SELECT col1 + col2 AS a FROM VALUES (1, 2) GROUP BY a; }}}
    *
+   * In case we are resolving names in expression trees from HAVING or ORDER 
BY on top of
+   * [[Aggregate]], we are able to resolve hidden attributes only if theose 
are present in
+   * grouping expressions, or if the reference itself is under an 
[[AggregateExpression]].

Review Comment:
   Please add a query example for the case where sort attribute is present in 
grouping expressions e.g
   ```
   --this passes
   SELECT COUNT(col1) FROM t1 GROUP BY col1, col2 ORDER BY col2;
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/NameScope.scala:
##########
@@ -918,20 +962,45 @@ class NameScopeStack extends SQLConfHelper {
   }
 
   /**
-   * When the scope gets the new output, we need to refresh nullabilities in 
its `hiddenOutput`. If
-   * an attribute is nullable in either old hidden output or new output, it 
must remain nullable in
-   * new hidden output as well.
+   * Update attribute properties when overwriting the current outputs.
+   *
+   * 1. When the scope gets the new output, we need to refresh nullabilities 
in its `hiddenOutput`.
+   * If an attribute is nullable in either old hidden output or new output, it 
must remain nullable
+   * in new hidden output as well.
+   *
+   * 2. If we are updating the hidden output on top of an [[Aggregate]], 
HAVING and ORDER BY clauses

Review Comment:
   Same as above



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/FunctionResolver.scala:
##########
@@ -77,9 +77,14 @@ class FunctionResolver(
 
   private val typeCoercionResolver: TypeCoercionResolver =
     new TypeCoercionResolver(timezoneAwareExpressionResolver)
+  private val expressionResolutionContextStack =
+    expressionResolver.getExpressionResolutionContextStack
 
   /**
    * Main method used to resolve an [[UnresolvedFunction]]. It resolves it in 
the following steps:
+   *  - Check if the `unresolvedFunction` is an aggregate expression. Set

Review Comment:
   ```suggestion
      *  - Check if the `unresolvedFunction` is an [[AggregateExpression]]. Set
   ```
   Otherwise it's easy to confuse `AggregateExpression` with one of the 
`Aggregate.aggregateExpressions`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51964][SQL] Correctly resolve attributes from hidden output in ORDER BY and HAVING on top of an Aggregate in single-pass Analyzer [spark]

Reply via email to