Hello, Getting started with Spark. Got JavaNetworkWordcount working on a 3 node cluster. netcat on 9999 with a infinite loop printing random numbers 0-100
With a duration of 1sec, I do see a list of (word, count) values every second. The list is limited to 10 values (as per the docs) The count is ~6000 counts per number. I assume that since my input is random numbers from 0 to 100, and i count 6000 for each, the distribution being homogeneous, that would mean 600,000 values are being ingested. I switch to using a constant number, and then I'm seeing between 200,000 and 2,000,000 counts, but the console response is erratic: it's not 1sec anymore, it's sometimes 2, sometimes more, and sometimes much faster... I am looking to do 1-to-1 processing (one value outputs one result) so I replaced the flatMap function with a map function, and do my calculation. Now I'd like to know how many events I was able to process but it's not clear at all: If I use print, it's fast again (1sec) but I only see the first 10 results. I was trying to add a counter... and realize the counter only seem to increment by only 11 each time This is very confusing... It looks like the counter is only incremented on the elements affected by the print statement... so does that mean the other values are not even calculated until requested? If i use .count() on the output RDD, then I do see a realistic count, but then it doesn't take 1sec anymore: it's more 4 to 5sec to get 600,000 - 1,000,000 events counted. I'm not sure where to get from here or how to benchmark the time to actually process the events.... Any hint or useful link would be appreciated. Thanks for your help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Duration-1s-not-matching-reality-tp21938.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org