Hello everyone,

I'm an undergrad working on a summarization project. I've created a
summarizer in normal Spark and it works great, however I want to write it
for Spark_Streaming to increase it's functionality. Basically I take in a
bunch of text and get the most popular words as well as most popular
bi-grams (Two words together), and I've managed to do this with streaming
(And made it stateful, which is great). However the next part of my
algorithm requires me to get the top 10 words and top 10 bigrams and store
them in a vector like structure. With just spark I would use code like;

array_of_words = words.sortByKey().top(50)

Is there a way to mimick this with streaming? I was following along with the
ampcamp  tutorial
<http://ampcamp.berkeley.edu/big-data-mini-course/realtime-processing-with-spark-streaming.html>
  
so I know that you can print the top 10 by using; 

sortedCounts.foreach(rdd =>
      println("\nTop 10 hashtags:\n" + rdd.take(10).mkString("\n")))

However I can't seem to alter this to make it store the top 10, just print
them. The instructor mentions at the end that

"one can get the top 10 hashtags in each partition, collect them together at
the driver and then find the top 10 hashtags among them" but they leave it
as an exercise. I would appreciate any help :)

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-collect-take-functionality-tp9670.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to