Re: Dynamic lookup table

Jason Fri, 28 Aug 2015 15:11:42 -0700

Hi Nikunj,

Depending on what kind of stats you want to accumulate, you may want to
look into the Accumulator/Accumulable API, or if you need more control, you
can store these things in an external key-value store (HBase, redis, etc..)
and do careful updates there. Though be careful and make sure your updates
are atomic (transactions or CAS semantics) or you could run into race
condition problems.


Jason

On Fri, Aug 28, 2015 at 11:39 AM N B <nb.nos...@gmail.com> wrote:

> Hi all,
>
> I have the following use case that I wanted to get some insight on how to
> go about doing in Spark Streaming.
>
> Every batch is processed through the pipeline and at the end, it has to
> update some statistics information. This updated info should be reusable in
> the next batch of this DStream e.g for looking up the relevant stat and it
> in turn refines the stats further. It has to continue doing this for every
> batch processed. First batch in the DStream can work with empty stats
> lookup without issue. Essentially, we are trying to do a feedback loop.
>
> What is a good pattern to apply for something like this? Some approaches
> that I considered are:
>
> 1. Use updateStateByKey(). But this produces a new DStream that I cannot
> refer back in the pipeline, so seems like a no-go but would be happy to be
> proven wrong.
>
> 2. Use broadcast variables to maintain this state in a Map for example and
> continue re-brodcasting it after every batch. I am not sure if this has
> performance implications or if its even a good idea.
>
> 3. IndexedRDD? Looked promising initially but I quickly realized that it
> might have the same issue as the updateStateByKey() approach, i.e. its not
> available in the pipeline before its created.
>
> 4. Any other ideas that are obvious and I am missing?
>
> Thanks
> Nikunj
>
>

Re: Dynamic lookup table

Reply via email to