Hi, I may be missing something but let’s say we would have worked with a DStream window. If we have a sliding window of 5 minutes every 1 minute then an RDD would have been generated every minute, then the RDD for the last 5 minutes would have been joined and then I would have converted them to dataframe. So I simulated it by skipping the RDD and going directly to union of dataframes. I believed that since spark caches shuffle results it would cache the partial aggregation and I would get an incremental aggregate. Obviously this is not happening and I am aggregating the older dataframes again and again.
What am I missing here? Is there a way to do it? As for structured streaming, I understand this is the future of spark but currently, there are too many important unsupported elements (e.g. no support for multiple aggregations, no support for distinct operations, no support for outer join etc.). Assaf. From: Tathagata Das [mailto:tathagata.das1...@gmail.com] Sent: Friday, December 23, 2016 2:46 AM To: Mendelson, Assaf Cc: user Subject: Re: streaming performance From what I understand looking at the code in stackoverflow, I think you are "simulating" the streaming version of your calculation incorrectly. You are repeatedly unioning batch dataframes to simulate streaming and then applying aggregation on the unioned DF. That will not going to compute aggregates incrementally, it will just process the whole data every time. So the oldest batch DF again and again, causing an increasing resource usage. Thats not how streaming works, so this is not simulating the right thing. With Structured Streaming's streaming dataframes, it is actually done incrementally. The best way to try that is to generate a file per "bucket", and then create a streaming dataframe on the files such that they are one by one. See this notebook for the maxFilesPerTrigger option in http://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Structured%20Streaming%20using%20Scala%20DataFrames%20API.html This would process each file one by one, maintain internal state to continuous update the aggregates and never require reprocessing the old data. Hope this helps On Wed, Dec 21, 2016 at 7:58 AM, Mendelson, Assaf <assaf.mendel...@rsa.com<mailto:assaf.mendel...@rsa.com>> wrote: am having trouble with streaming performance. My main problem is how to do a sliding window calculation where the ratio between the window size and the step size is relatively large (hundreds) without recalculating everything all the time. I created a simple example of what I am aiming at with what I have so far which is detailed in http://stackoverflow.com/questions/41266956/apache-spark-streaming-performance I was hoping someone can point me to what I am doing wrong. Thanks, Assaf.