guangyu-yang-rokt commented on PR #49766: URL: https://github.com/apache/spark/pull/49766#issuecomment-2784772338
Thanks @szehon-ho! one follow up question - in our query, we only do filtering on timestamp column but join key is something different (joining on non-partition keys). I have checked that BatchScanExec is reporting `groupedBy=[timestamp_day]` in query plan. I'm not too familiar with spark codebase but I guess filter pushdown to iceberg also tell BatchScanExec to group by partition key if there is a filter on partition key. With `spark.sql.sources.v2.bucketing.enabled`set to true, it will slow down batchScan for joins that are not joining on partition keys. (we have a self-implemented featurestore which will spin up multiple joins to gather features in so I need to enable all SPJ related configs globally) This is kind not making sense to me since I'm not joining on timestamp so I would expect SPJ shouldn't kick in. Or I would imagine a configuration like `spark.sql.sources.v2.ignoreFiltering` to tell BatchScanExec don't grouped by partition key if it is just a filter and not a join key -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org