Hi, I'm completely new to Spark streaming (and Spark) and have been reading up on it and trying out various examples the past few days. I have a particular use case which I think it would work well for, but I wanted to put it out there and get some feedback on whether or not it actually would. The use case is:
We have web tracking data continuously coming in from a pool of web servers. For simplicity, let's just say the data is text lines with a known set of fields, eg: "timestamp userId domain ...". What I want to do is: 1. group this continuous stream of data by "userId:domain", and 2. when the latest timestamp in each group is older than a certain threshold, persist the results to a DB #1 is straightforward and there are plenty of examples showing how to do it. However, I'm not sure how I would go about doing #2, or if that's something I can even do with spark because as far as I can tell it operates on sliding windows. I really just want to continue to accumulate these groups of "userId:domain" for all time (without specifying a window) and then roll them up and flush them once no new data has come in for a group after a certain amount of time. Would the updateStateByKey function allow me to do this somehow? Any help would be appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Valid-spark-streaming-use-case-tp4410.html Sent from the Apache Spark User List mailing list archive at Nabble.com.