guangyu-yang-rokt commented on PR #49766: URL: https://github.com/apache/spark/pull/49766#issuecomment-2783546044
Hi @dongjoon-hyun, sorry this might be a question a bit unrelated to this PR. Context: We are currently introducing SPJ to our production environment. Our iceberg table is partitioned by timestamp with day transformation and in our ML processing job we will read past 30 days worth of data with filter on timestamp column which will be pushed down to iceberg. So 30 partitions will be reported by iceberg. I have observed that with `spark.sql.sources.v2.bucketing.enabled`, spark will then generate one task per partition, which will be only 30 tasks in our case when we are doing batchScan. This has led to cluster resource under utilisation since we have 40 execs and 15 cores each (so at max 600 tasks in parallel). That impact the batchScan performance a lot - same stage from 2.4 mins to 10+mins. Have you encountered the same issue in your use case? I must be missing something here. Any insights will be much appreciated! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org