This is a good usecase for using DStream.updateStateByKey! This allows you
to maintain arbitrary per-key state. Checkout this example.
https://github.com/tdas/spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/StatefulNetworkWordCount.scala

Also take a look at the documentation for more information:

This example just keeps the an integer (the running count) as the state,
but in practice it can be anything. In your case, you can keep that
per-group (i.e. per-key) threshold as the state. In the update function,
for each record of a group, you can check the timestamp against the
threshold and push stuff to DB and maybe update the threshold.

TD



On Thu, Apr 17, 2014 at 1:24 PM, xargsgrep <ahs...@gmail.com> wrote:

> Hi, I'm completely new to Spark streaming (and Spark) and have been reading
> up on it and trying out various examples the past few days. I have a
> particular use case which I think it would work well for, but I wanted to
> put it out there and get some feedback on whether or not it actually would.
> The use case is:
>
> We have web tracking data continuously coming in from a pool of web
> servers.
> For simplicity, let's just say the data is text lines with a known set of
> fields, eg: "timestamp userId domain ...". What I want to do is:
> 1. group this continuous stream of data by "userId:domain", and
> 2. when the latest timestamp in each group is older than a certain
> threshold, persist the results to a DB
>
> #1 is straightforward and there are plenty of examples showing how to do
> it.
> However, I'm not sure how I would go about doing #2, or if that's something
> I can even do with spark because as far as I can tell it operates on
> sliding
> windows. I really just want to continue to accumulate these groups of
> "userId:domain" for all time (without specifying a window) and then roll
> them up and flush them once no new data has come in for a group after a
> certain amount of time. Would the updateStateByKey function allow me to do
> this somehow?
>
> Any help would be appreciated.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Valid-spark-streaming-use-case-tp4410.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to