Hi, I'm completely new to Spark streaming (and Spark) and have been reading
up on it and trying out various examples the past few days. I have a
particular use case which I think it would work well for, but I wanted to
put it out there and get some feedback on whether or not it actually would.
The use case is:

We have web tracking data continuously coming in from a pool of web servers.
For simplicity, let's just say the data is text lines with a known set of
fields, eg: "timestamp userId domain ...". What I want to do is:
1. group this continuous stream of data by "userId:domain", and
2. when the latest timestamp in each group is older than a certain
threshold, persist the results to a DB

#1 is straightforward and there are plenty of examples showing how to do it.
However, I'm not sure how I would go about doing #2, or if that's something
I can even do with spark because as far as I can tell it operates on sliding
windows. I really just want to continue to accumulate these groups of
"userId:domain" for all time (without specifying a window) and then roll
them up and flush them once no new data has come in for a group after a
certain amount of time. Would the updateStateByKey function allow me to do
this somehow?

Any help would be appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Valid-spark-streaming-use-case-tp4410.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to