SortMerge Join on partitioned column causes shuffle

lsn24 Tue, 26 Mar 2019 17:29:54 -0700

Hello,

 We got two datasets thats been persisted as follows:


Dataset A:
datasetA.repartition(5,datasetA.col("region"))
                .write().mode(saveMode)
                .format("parquet")
                .partitionBy("region")
                .bucketBy(5,"studentId")
                .sortBy("studentId")
                .option("path", parquetFilesDirectory)
                .saveAsTable( database.tableA));

Dataset B:
datasetB.repartition(5,datasetB.col("region"))
                .write().mode(saveMode)
                .format("parquet")
                .partitionBy("region")
                .bucketBy(5,"studentId")
                .sortBy("studentId")
                .option("path", parquetFilesDirectory)
                .saveAsTable( database.tableB));


When I do a  join with region and studentId , I see shuffle. If I do join
just with the bucketed column studentId, there is NO shuffle as expected.
Below is the join query.

spark.sql("Select *  from  database.tableA").join(spark.sql("Select *  from  
database.tableB "), Seq("studentId","region")).show(10)

What could be the reason for the shuffle when we include the partitionkey
and how can we mitigate it ?

Thanks






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

SortMerge Join on partitioned column causes shuffle

Reply via email to