Tom-Newton commented on PR #43760:
URL: https://github.com/apache/spark/pull/43760#issuecomment-2479390482

   > The fix addresses the issue by disabling coalescing in InMemoryTableScan 
for shuffles in the final stage.
   
   This PR seems to indicate that there will be a correctness bug if we allow 
coalescing in the final stage. Is this still true after 
https://github.com/apache/spark/pull/43435 which seems like its fixing the same 
issue? I have not noticed any correctness issues in this area but I have got a 
usecase where this PR causes a major performance regression. 
   
   I notice that https://github.com/apache/spark/pull/45054 brings back the 
option to enable coalescing in InMemoryTableScan for shuffles in the final 
stage. If I do this my performance problem is resolved but will I be at risk of 
the correctness bug again?
   
   <details>
     <summary>Details on my performance regression case</summary>
   
     If you configure `spark.sql.shuffle.partitions` or 
`spark.sql.adaptive.coalescePartitions.initialPartitionNum` to a large number, 
we currently use 8192. The following code ends up using 8192 partitions and is 
really slow as a result.
     ```python
           df = spark.createDataFrame(
               [
                   {
                       "group": "group0",
                       "df1_column": 1,
                   },
                   {
                       "group": "group0",
                       "df1_column": 1,
                   },
               ]
           )
           df = df.groupBy("group").agg(sf.max("df1_column"))
           df.cache()
           df.explain()
           df.show()
     ```
   With this PR: 40 seconds 
   Without this PR: 2 seconds
   
   Maybe its our mistake using 
`spark.sql.adaptive.coalescePartitions.initialPartitionNum: 8192` but I don't 
really see any generic way to configure this appropriately. My strategy has 
just been make it bigger than we would ever need and rely on AQE to coalesce to 
something sensible, but my understanding could be lacking. 
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to