Hi, I have a system which does streaming doing analysis over a long period of time. For example a sliding window of 24 hours every 15 minutes. I have a batch process I need to convert to this streaming. I am wondering how to do so efficiently.
I am currently building the streaming process so I can use DStream or create dataframe for each time period manually. I know that if I have a groupby, spark would cache the groupby and therefore only the new time period would be calculated. My problem, however, is handling window functions. Consider an example where I have a window function that counts the number of failed logins before a successful one in the last 2 hours. How would I convert it to streaming so it wouldn't be recalculated every 15 minutes from scratch? Thanks, Assaf.