Tom-Newton commented on PR #43760: URL: https://github.com/apache/spark/pull/43760#issuecomment-2479390482
> The fix addresses the issue by disabling coalescing in InMemoryTableScan for shuffles in the final stage. This PR seems to indicate that there will be a correctness bug if we allow coalescing in the final stage. Is this still true after https://github.com/apache/spark/pull/43435 which seems like its fixing the same issue? I have not noticed any correctness issues in this area but I have got a usecase where this PR causes a major performance regression. I notice that https://github.com/apache/spark/pull/45054 brings back the option to enable coalescing in InMemoryTableScan for shuffles in the final stage. If I do this my performance problem is resolved but will I be at risk of the correctness bug again? <details> <summary>Details on my performance regression case</summary> If you configure `spark.sql.shuffle.partitions` or `spark.sql.adaptive.coalescePartitions.initialPartitionNum` to a large number, we currently use 8192. The following code ends up using 8192 partitions and is really slow as a result. ```python df = spark.createDataFrame( [ { "group": "group0", "df1_column": 1, }, { "group": "group0", "df1_column": 1, }, ] ) df = df.groupBy("group").agg(sf.max("df1_column")) df.cache() df.explain() df.show() ``` With this PR: 40 seconds Without this PR: 2 seconds Maybe its our mistake using `spark.sql.adaptive.coalescePartitions.initialPartitionNum: 8192` but I don't really see any generic way to configure this appropriately. My strategy has just been make it bigger than we would ever need and rely on AQE to coalesce to something sensible, but my understanding could be lacking. </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org