https://issues.apache.org/jira/browse/SPARK-19256 is an active umbrella
feature.
But as of 2.2, you can invoke APIs on DataFrames today to bucketize them on
serialization using Hive.
If you invoke
val bucketCount = 100
df1
.repartition(bucketCount, col("a"), col("b"))
.bucketBy(bucketCount, "a"
billion rows
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Support-for-Hive-buckets-tp8421p9905.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
-
To
Hi Cody,
There are currently no concrete plans for adding buckets to Spark SQL, but
thats mostly due to lack of resources / demand for this feature. Adding
full support is probably a fair amount of work since you'd have to make
changes throughout parsing/optimization/execution. That said, there
I noticed that the release notes for 1.1.0 said that spark doesn't support
Hive buckets "yet". I didn't notice any jira issues related to adding
support.
Broadly speaking, what would be involved in supporting buckets, especially
the bucketmapjoin and sortedmerge optimizations?