good suggestion, td.
and i believe the optimization that jon.burns is referring to - from the
big data mini course - is a step earlier: the sorting mechanism that
produces sortedCounts.
you can use mapPartitions() to get a top k locally on each partition, then
shuffle only (k * # of partitions)
It works perfect, thanks!. I feel like I should have figured that out, I'll
chalk it up to inexperience with Scala. Thanks again.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-collect-take-functionality-tp9670p9772.html
Sent from the Apach
Why doesnt something like this work? If you want a continuously updated
reference to the top counts, you can use a global variable.
var topCounts: Array[(String, Int)] = null
sortedCounts.foreachRDD (rdd =>
val currentTopCounts = rdd.take(10)
// print currentTopCounts it or watever
top