Hello everyone, I'm an undergrad working on a summarization project. I've created a summarizer in normal Spark and it works great, however I want to write it for Spark_Streaming to increase it's functionality. Basically I take in a bunch of text and get the most popular words as well as most popular bi-grams (Two words together), and I've managed to do this with streaming (And made it stateful, which is great). However the next part of my algorithm requires me to get the top 10 words and top 10 bigrams and store them in a vector like structure. With just spark I would use code like;
array_of_words = words.sortByKey().top(50) Is there a way to mimick this with streaming? I was following along with the ampcamp tutorial <http://ampcamp.berkeley.edu/big-data-mini-course/realtime-processing-with-spark-streaming.html> so I know that you can print the top 10 by using; sortedCounts.foreach(rdd => println("\nTop 10 hashtags:\n" + rdd.take(10).mkString("\n"))) However I can't seem to alter this to make it store the top 10, just print them. The instructor mentions at the end that "one can get the top 10 hashtags in each partition, collect them together at the driver and then find the top 10 hashtags among them" but they leave it as an exercise. I would appreciate any help :) Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-collect-take-functionality-tp9670.html Sent from the Apache Spark User List mailing list archive at Nabble.com.