Hi Jason, Thanks for the response. I believe I can look into a Redis based solution for storing this state externally. However, would it be possible to refresh this from the store with every batch i.e. what code can be written inside the pipeline to fetch this info from the external store? Also, seems like a waste since all that info should be available from within spark already. The number of keys and amount of data is going to be quite limited in this case. It just needs to be updated periodically.
The Accumulator based solution only works for simple counting and we have a larger stats object that includes things like Averages, Variance etc. The Accumulable interface potentially can be implemented for this but then the other restriction seems to be that the values of such an accumulator can only be accessed on the driver/master. We want this info to be available in the data processing for the next batch as a lookup for example. Do you know of a way to make that possible with Accumulable? Thanks Nikunj On Fri, Aug 28, 2015 at 3:10 PM, Jason <ja...@jasonknight.us> wrote: > Hi Nikunj, > > Depending on what kind of stats you want to accumulate, you may want to > look into the Accumulator/Accumulable API, or if you need more control, you > can store these things in an external key-value store (HBase, redis, etc..) > and do careful updates there. Though be careful and make sure your updates > are atomic (transactions or CAS semantics) or you could run into race > condition problems. > > Jason > > On Fri, Aug 28, 2015 at 11:39 AM N B <nb.nos...@gmail.com> wrote: > >> Hi all, >> >> I have the following use case that I wanted to get some insight on how to >> go about doing in Spark Streaming. >> >> Every batch is processed through the pipeline and at the end, it has to >> update some statistics information. This updated info should be reusable in >> the next batch of this DStream e.g for looking up the relevant stat and it >> in turn refines the stats further. It has to continue doing this for every >> batch processed. First batch in the DStream can work with empty stats >> lookup without issue. Essentially, we are trying to do a feedback loop. >> >> What is a good pattern to apply for something like this? Some approaches >> that I considered are: >> >> 1. Use updateStateByKey(). But this produces a new DStream that I cannot >> refer back in the pipeline, so seems like a no-go but would be happy to be >> proven wrong. >> >> 2. Use broadcast variables to maintain this state in a Map for example >> and continue re-brodcasting it after every batch. I am not sure if this has >> performance implications or if its even a good idea. >> >> 3. IndexedRDD? Looked promising initially but I quickly realized that it >> might have the same issue as the updateStateByKey() approach, i.e. its not >> available in the pipeline before its created. >> >> 4. Any other ideas that are obvious and I am missing? >> >> Thanks >> Nikunj >> >>