Re: [PR] [SPARK-51064][SQL] Enable `spark.sql.sources.v2.bucketing.enabled` by default [spark]

via GitHub Mon, 07 Apr 2025 07:32:35 -0700


guangyu-yang-rokt commented on PR #49766:
URL: https://github.com/apache/spark/pull/49766#issuecomment-2783546044


   Hi @dongjoon-hyun, sorry this might be a question a bit unrelated to this 
PR.  
   
   Context:
   We are currently introducing SPJ to our production environment. Our iceberg 
table is partitioned by timestamp with day transformation and in our ML 
processing job we will read past 30 days worth of data with filter on timestamp 
column which will be pushed down to iceberg. So 30 partitions will be reported 
by iceberg. I have observed that with `spark.sql.sources.v2.bucketing.enabled`, 
spark will then generate one task per partition, which will be only 30 tasks in 
our case when we are doing batchScan. This has led to cluster resource under 
utilisation since we have 40 execs and 15 cores each (so at max 600 tasks in 
parallel). That impact the batchScan performance a lot - same stage from 2.4 
mins to 10+mins.
   
   Have you encountered the same issue in your use case? I must be missing 
something here. Any insights will be much appreciated! 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51064][SQL] Enable `spark.sql.sources.v2.bucketing.enabled` by default [spark]

Reply via email to