Re: [PR] [SPARK-51064][SQL] Enable `spark.sql.sources.v2.bucketing.enabled` by default [spark]

via GitHub Mon, 07 Apr 2025 15:29:36 -0700


guangyu-yang-rokt commented on PR #49766:
URL: https://github.com/apache/spark/pull/49766#issuecomment-2784772338


   Thanks @szehon-ho!  one follow up question - in our query, we only do 
filtering on timestamp column but join key is something different (joining on 
non-partition keys). I have checked that BatchScanExec is reporting 
`groupedBy=[timestamp_day]` in query plan. I'm not too familiar with spark 
codebase but I guess filter pushdown to iceberg also tell BatchScanExec to 
group by partition key if there is a filter on partition key. With 
`spark.sql.sources.v2.bucketing.enabled`set to true, it will slow down 
batchScan for joins that are not joining on partition keys. (we have a 
self-implemented featurestore which will spin up multiple joins to gather 
features in so I need to enable all SPJ related configs globally)
   
   This is kind not making sense to me since I'm not joining on timestamp so I 
would expect SPJ shouldn't kick in. Or I would imagine a configuration like 
`spark.sql.sources.v2.ignoreFiltering` to tell BatchScanExec don't grouped by 
partition key if it is just a filter and not a join key
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51064][SQL] Enable `spark.sql.sources.v2.bucketing.enabled` by default [spark]

Reply via email to