ashutoshcipher opened a new issue, #10359:
URL: https://github.com/apache/incubator-gluten/issues/10359
### Description
## Overview
Native write paths in Spark 3.2/3.3 and the ClickHouse MergeTree writer
recompute the `__bucket_value__` expression even when a precomputed
attribute already exists. This adds unnecessary overhead and complicates
downstream projections.
## Steps to Reproduce
- Write bucketed data using the native writer.
- Inspect the execution plan; the bucket ID is projected multiple times
instead of reusing a single attribute.
## Expected Behavior
The bucket ID should be computed once (e.g., in an initial `ProjectExec`)
and reused by subsequent stages.
## Actual Behavior
Every stage re-evaluates the bucket expression, leading to redundant
projections and performance overhead.
## Impact
- Increased CPU time for bucketed writes.
- Harder-to-read execution plans with repetitive projections.
## Proposed Fix
- Guard projections in Spark 3.2/3.3 shims and ClickHouse MergeTree writer
so they append `__bucket_value__` only when missing.
- Store the computed bucket ID in an attribute and reuse it downstream.
### Gluten version
None
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]