Hello, We got two datasets thats been persisted as follows:
Dataset A: datasetA.repartition(5,datasetA.col("region")) .write().mode(saveMode) .format("parquet") .partitionBy("region") .bucketBy(5,"studentId") .sortBy("studentId") .option("path", parquetFilesDirectory) .saveAsTable( database.tableA)); Dataset B: datasetB.repartition(5,datasetB.col("region")) .write().mode(saveMode) .format("parquet") .partitionBy("region") .bucketBy(5,"studentId") .sortBy("studentId") .option("path", parquetFilesDirectory) .saveAsTable( database.tableB)); When I do a join with region and studentId , I see shuffle. If I do join just with the bucketed column studentId, there is NO shuffle as expected. Below is the join query. spark.sql("Select * from database.tableA").join(spark.sql("Select * from database.tableB "), Seq("studentId","region")).show(10) What could be the reason for the shuffle when we include the partitionkey and how can we mitigate it ? Thanks -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org