Re: [PR] [SPARK-51256][SQL] Increase parallelism if joining with small bucket table [spark]

via GitHub Sun, 23 Feb 2025 17:47:44 -0800


wangyum commented on PR #50004:
URL: https://github.com/apache/spark/pull/50004#issuecomment-2677280562


   ```scala
   spark.sql("set spark.sql.autoBroadcastJoinThreshold=-1")
   spark.range(10000000).selectExpr("id", "id + 1 as 
new_id").write.saveAsTable("t1")
   spark.range(10).selectExpr("id").write.bucketBy(1, "id").saveAsTable("t2")
   spark.sql("select * from t1 join t2 on t1.id = t2.id").explain("cost")
   ```
   Spark 3.2:
   ```
   == Physical Plan ==
   AdaptiveSparkPlan isFinalPlan=false
   +- SortMergeJoin [id#21L], [id#23L], Inner
      :- Sort [id#21L ASC NULLS FIRST], false, 0
      :  +- Exchange hashpartitioning(id#21L, 200), ENSURE_REQUIREMENTS, 
[plan_id=50]
      :     +- Filter isnotnull(id#21L)
      :        +- FileScan parquet default.t1[id#21L,new_id#22L] Batched: true, 
DataFilters: [isnotnull(id#21L)], Format: Parquet, Location: 
InMemoryFileIndex(1 
paths)[file:/Users/yumwang/Downloads/spark-3.2.4-bin-hadoop3.2/spark-warehous...,
 PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
struct<id:bigint,new_id:bigint>
      +- Sort [id#23L ASC NULLS FIRST], false, 0
         +- Exchange hashpartitioning(id#23L, 200), ENSURE_REQUIREMENTS, 
[plan_id=57]
            +- Filter isnotnull(id#23L)
               +- FileScan parquet default.t2[id#23L] Batched: true, 
DataFilters: [isnotnull(id#23L)], Format: Parquet, Location: 
InMemoryFileIndex(1 
paths)[file:/Users/yumwang/Downloads/spark-3.2.4-bin-hadoop3.2/spark-warehous...,
 PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
struct<id:bigint>
   ```
   
   After Spark 3.3:
   ```
   == Physical Plan ==
   AdaptiveSparkPlan isFinalPlan=false
   +- SortMergeJoin [id#21L], [id#23L], Inner
      :- Sort [id#21L ASC NULLS FIRST], false, 0
      :  +- Exchange hashpartitioning(id#21L, 1), ENSURE_REQUIREMENTS, 
[plan_id=51]
      :     +- Filter isnotnull(id#21L)
      :        +- FileScan parquet default.t1[id#21L,new_id#22L] Batched: true, 
DataFilters: [isnotnull(id#21L)], Format: Parquet, Location: 
InMemoryFileIndex(1 
paths)[file:/Users/yumwang/Downloads/spark-3.3.3-bin-hadoop3/spark-warehouse/t1],
 PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
struct<id:bigint,new_id:bigint>
      +- Sort [id#23L ASC NULLS FIRST], false, 0
         +- Filter isnotnull(id#23L)
            +- FileScan parquet default.t2[id#23L] Batched: true, Bucketed: 
true, DataFilters: [isnotnull(id#23L)], Format: Parquet, Location: 
InMemoryFileIndex(1 
paths)[file:/Users/yumwang/Downloads/spark-3.3.3-bin-hadoop3/spark-warehouse/t2],
 PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
struct<id:bigint>, SelectedBucketsCount: 1 out of 1
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51256][SQL] Increase parallelism if joining with small bucket table [spark]

Reply via email to