https://issues.apache.org/jira/browse/SPARK-19256 is an active umbrella feature.
But as of 2.2, you can invoke APIs on DataFrames today to bucketize them on serialization using Hive. If you invoke val bucketCount = 100 df1 .repartition(bucketCount, col("a"), col("b")) .bucketBy(bucketCount, "a","b") .sortBy("a", "b") .saveAsTable("default.table_1") df2 .repartition(bucketCount, col("a"), col("b")) .bucketBy(bucketCount, "a","b") .sortBy("a", "b") .saveAsTable("default.table_2") Then, join table_1 on table_2 on "a", "b", you'll find that your query plan involves no sort or exchange, only a SortMergeJoin. -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org