Spark Streaming - Duration 1s not matching reality

eleroy Thu, 05 Mar 2015 15:11:39 -0800

Hello,

Getting started with Spark. 
Got JavaNetworkWordcount working on a 3 node cluster. netcat on 9999 with a
infinite loop printing random numbers 0-100

With a duration of 1sec, I do see a list of (word, count) values every
second. The list is limited to 10 values (as per the docs)

The count is ~6000 counts per number. I assume that since my input is random
numbers from 0 to 100, and i count 6000 for each, the distribution being
homogeneous, that would mean 600,000 values are being ingested.
I switch to using a constant number, and then I'm seeing between 200,000 and
2,000,000 counts, but the console response is erratic: it's not 1sec
anymore, it's sometimes 2, sometimes more, and sometimes much faster...

I am looking to do 1-to-1 processing (one value outputs one result) so I
replaced the flatMap function with a map function, and do my calculation.

Now I'd like to know how many events I was able to process but it's not
clear at all:
If I use print, it's fast again (1sec) but I only see the first 10 results.
I was trying to add a counter... and realize the counter only seem to
increment by only 11 each time

This is very confusing... It looks like the counter is only incremented on
the elements affected by the print statement... so does that mean the other
values are not even calculated until requested?

If i use .count() on the output RDD, then I do see a realistic count, but
then it doesn't take 1sec anymore: it's more 4 to 5sec to get 600,000 -
1,000,000 events counted.

I'm not sure where to get from here or how to benchmark the time to actually
process the events....

Any hint or useful link would be appreciated.
Thanks for your help.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Duration-1s-not-matching-reality-tp21938.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark Streaming - Duration 1s not matching reality

Reply via email to