Hi all,

I have the following use case that I wanted to get some insight on how to
go about doing in Spark Streaming.

Every batch is processed through the pipeline and at the end, it has to
update some statistics information. This updated info should be reusable in
the next batch of this DStream e.g for looking up the relevant stat and it
in turn refines the stats further. It has to continue doing this for every
batch processed. First batch in the DStream can work with empty stats
lookup without issue. Essentially, we are trying to do a feedback loop.

What is a good pattern to apply for something like this? Some approaches
that I considered are:

1. Use updateStateByKey(). But this produces a new DStream that I cannot
refer back in the pipeline, so seems like a no-go but would be happy to be
proven wrong.

2. Use broadcast variables to maintain this state in a Map for example and
continue re-brodcasting it after every batch. I am not sure if this has
performance implications or if its even a good idea.

3. IndexedRDD? Looked promising initially but I quickly realized that it
might have the same issue as the updateStateByKey() approach, i.e. its not
available in the pipeline before its created.

4. Any other ideas that are obvious and I am missing?

Thanks
Nikunj

Reply via email to