yes, I agree to directly transform on DStream even there is no data injected
in this batch duration.
while my situation is :
Spark receive flume stream continurously, and I use updateStateByKey
function to collect data for a key among several batches, then I will handle
the collected data after waiting a specified time (which I use a counter to
measure) since the first time no data updated in the updateStateByKey
operation.  Normally, when the waiting time is up, I should collected all
data for a key. But if the flume data source is broken for a while, and if
this interval is over the waiting time, then I will only get partial data
for a key. So I need a way to determine whether current flume stream batch
contains data, if no, it means the flume data source is broken, then I can
skip the updateStateByKey operation, till the flume data source is
reconnected, then the counter in the updateStateByKey function can count
again. In this way I could get the intack data.

another question, why the count variable in map cannot work but it effects
in the foreachRDD in my previous code?

thanks :P 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-size-of-DStream-tp13769p13785.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to