Hi,
Can you try the combination of `repartition` + `sortWithinPartitions` on the
dataset?
E.g.,
val df = Seq((2, "b c a"), (1, "c a b"), (3, "a c b")).toDF("number",
"letters")
val df2 =
df.explode('letters) {
case Row(letters: String) => letters.split(" ").map(Tuple1(_)).toSeq
}
df2
.select('number, '_1 as 'letter)
.repartition('number)
.sortWithinPartitions('number, 'letter)
.groupBy('number)
.agg(collect_list('letter))
.show()
+------+--------------------+
|number|collect_list(letter)|
+------+--------------------+
| 3| [a, b, c]|
| 1| [a, b, c]|
| 2| [a, b, c]|
+------+--------------------+
I think it should let you do aggregate on sorted data per key.
-----
Liang-Chi Hsieh | @viirya
Spark Technology Center
http://www.spark.tc/
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp19999p20310.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]