https://issues.apache.org/jira/browse/SPARK-19256 is an active umbrella
feature.

But as of 2.2, you can invoke APIs on DataFrames today to bucketize them on
serialization using Hive.

If you invoke

val bucketCount = 100

df1
.repartition(bucketCount, col("a"), col("b"))
.bucketBy(bucketCount, "a","b")
.sortBy("a", "b")
.saveAsTable("default.table_1")

df2
.repartition(bucketCount, col("a"), col("b"))
.bucketBy(bucketCount, "a","b")
.sortBy("a", "b")
.saveAsTable("default.table_2")

Then, join table_1 on table_2 on "a", "b",  you'll find that your query plan
involves no sort or exchange, only a SortMergeJoin. 




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to