mihailotim-db opened a new pull request, #54641:
URL: https://github.com/apache/spark/pull/54641
### What changes were proposed in this pull request?
When `Expand` is created for `ROLLUP`/`CUBE`/`GROUPING SETS`, its output
contains duplicate-named attributes: the original pass-through child attribute
(e.g., `a#0`) and a new grouping instance created via `newInstance()` (e.g.,
`a#5`). Both share the same name, which causes `AMBIGUOUS_REFERENCE` errors
when any operator performs name-based resolution against the `Expand` output.
This PR tags pass-through child attributes with `__is_duplicate` metadata in
`Expand.apply()`, so that `AttributeSeq.getCandidatesForResolution`
deprioritizes them when multiple candidates match by name. This is the same
mechanism already used by `DeduplicateUnionChildOutput` for Union operators.
Only attributes whose `ExprId` matches a simple `Attribute` child of a
`groupByAlias` are tagged — complex grouping expressions (e.g., `c1 + 1`)
produce aliases with different names than any child column, so no name conflict
arises. ExprId-based resolution (used for already-resolved expressions like
aggregate functions) is unaffected.
The fix is guarded behind a new internal config
`spark.sql.analyzer.expandTagPassthroughDuplicates` (default `true`).
### Why are the changes needed?
The `Expand` operator for `ROLLUP`/`CUBE`/`GROUPING SETS` produces an output
like `[a#0, b#1, c#2, a#5, gid#3]` where `a#0` is the pass-through child
attribute and `a#5` is the new grouping attribute. Both have the name `"a"`.
When any operator above the `Expand` resolves the reference `"a"` by name
(e.g., a `Filter` or `Project` inserted by a custom analysis rule, or a
correlated subquery whose outer reference resolves against the `Expand`'s
output), `getCandidatesForResolution` returns two candidates, and `resolve()`
throws an `AMBIGUOUS_REFERENCE` error.
### Does this PR introduce _any_ user-facing change?
No. The fix prevents a latent `AMBIGUOUS_REFERENCE` error in name-based
resolution against `Expand` output. Standard SQL queries are not affected
because the `Aggregate` above the `Expand` already shields upper operators from
seeing the duplicate names. The fix is defensive and makes the `Expand` output
safe for any future feature or custom rule that may resolve names against it.
### How was this patch tested?
7 new unit tests in `ResolveGroupingAnalyticsSuite`:
- **Tagging behavior (flag enabled, default):**
- Tags pass-through attribute for simple single-column grouping
(`ROLLUP(a)`)
- Does not tag for complex grouping expressions (`ROLLUP(a + 1)`)
- Tags multiple pass-through attributes for multi-column grouping
(`ROLLUP(a, b)`)
- Preserves `ExprId` and name on tagged attributes
- Demonstrates that `resolve("a")` succeeds with tagging and throws
`AMBIGUOUS_REFERENCE` without tagging
- **Flag disabled behavior:**
- No tagging for single-column grouping; `resolve("a")` throws
`AMBIGUOUS_REFERENCE`
- No tagging for multi-column grouping; `resolve("a")` and `resolve("b")`
both throw `AMBIGUOUS_REFERENCE`
All 9 pre-existing tests in `ResolveGroupingAnalyticsSuite` continue to pass.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude claude-4.6-opus-high-thinking (Cursor)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]