[I] Avoid duplicate bucket ID projection in native write paths [incubator-gluten]

via GitHub Mon, 04 Aug 2025 19:25:46 -0700


ashutoshcipher opened a new issue, #10359:
URL: https://github.com/apache/incubator-gluten/issues/10359


   ### Description
   
   ## Overview
   Native write paths in Spark 3.2/3.3 and the ClickHouse MergeTree writer
   recompute the `__bucket_value__` expression even when a precomputed
   attribute already exists. This adds unnecessary overhead and complicates
   downstream projections.
   
   ## Steps to Reproduce
   - Write bucketed data using the native writer.
   - Inspect the execution plan; the bucket ID is projected multiple times
     instead of reusing a single attribute.
   
   ## Expected Behavior
   The bucket ID should be computed once (e.g., in an initial `ProjectExec`)
   and reused by subsequent stages.
   
   ## Actual Behavior
   Every stage re-evaluates the bucket expression, leading to redundant
   projections and performance overhead.
   
   ## Impact
   - Increased CPU time for bucketed writes.
   - Harder-to-read execution plans with repetitive projections.
   
   ## Proposed Fix
   - Guard projections in Spark 3.2/3.3 shims and ClickHouse MergeTree writer
     so they append `__bucket_value__` only when missing.
   - Store the computed bucket ID in an attribute and reuse it downstream.
   
   
   ### Gluten version
   
   None


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Avoid duplicate bucket ID projection in native write paths [incubator-gluten]

Reply via email to