yes, I agree to directly transform on DStream even there is no data injected in this batch duration. while my situation is : Spark receive flume stream continurously, and I use updateStateByKey function to collect data for a key among several batches, then I will handle the collected data after waiting a specified time (which I use a counter to measure) since the first time no data updated in the updateStateByKey operation. Normally, when the waiting time is up, I should collected all data for a key. But if the flume data source is broken for a while, and if this interval is over the waiting time, then I will only get partial data for a key. So I need a way to determine whether current flume stream batch contains data, if no, it means the flume data source is broken, then I can skip the updateStateByKey operation, till the flume data source is reconnected, then the counter in the updateStateByKey function can count again. In this way I could get the intack data.
another question, why the count variable in map cannot work but it effects in the foreachRDD in my previous code? thanks :P -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-size-of-DStream-tp13769p13785.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org