That is a valid point Shao. However, it will start using disk space as memory storage akin to swap space. It will not crash I believe it will just be slow and this assumes that you do not run out of disk space.
Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 9 May 2016 at 08:14, Saisai Shao <sai.sai.s...@gmail.com> wrote: > For window related operators, Spark Streaming will cache the data into > memory within this window, in your case your window size is up to 24 hours, > which means data has to be in Executor's memory for more than 1 day, this > may introduce several problems when memory is not enough. > > On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh <mich.talebza...@gmail.com > > wrote: > >> ok terms for Spark Streaming >> >> "Batch interval" is the basic interval at which the system with receive >> the data in batches. >> This is the interval set when creating a StreamingContext. For example, >> if you set the batch interval as 300 seconds, then any input DStream will >> generate RDDs of received data at 300 seconds intervals. >> A window operator is defined by two parameters - >> - WindowDuration / WindowsLength - the length of the window >> - SlideDuration / SlidingInterval - the interval at which the window will >> slide or move forward >> >> >> Ok so your batch interval is 5 minutes. That is the rate messages are >> coming in from the source. >> >> Then you have these two params >> >> // window length - The duration of the window below that must be multiple >> of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) >> val windowLength = x = m * n >> // sliding interval - The interval at which the window operation is >> performed in other words data is collected within this "previous interval' >> val slidingInterval = y l x/y = even number >> >> Both the window length and the slidingInterval duration must be multiples >> of the batch interval, as received data is divided into batches of duration >> "batch interval". >> >> If you want to collect 1 hour data then windowLength = 12 * 5 * 60 >> seconds >> If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60 >> >> You sliding window should be set to batch interval = 5 * 60 seconds. In >> other words that where the aggregates and summaries come for your report. >> >> What is your data source here? >> >> HTH >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> wrote: >> >>> We have some stream data need to be calculated and considering use spark >>> stream to do it. >>> >>> We need to generate three kinds of reports. The reports are based on >>> >>> 1. The last 5 minutes data >>> 2. The last 1 hour data >>> 3. The last 24 hour data >>> >>> The frequency of reports is 5 minutes. >>> >>> After reading the docs, the most obvious way to solve this seems to set >>> up a >>> spark stream with 5 minutes interval and two window which are 1 hour and >>> 1 >>> day. >>> >>> >>> But I am worrying that if the window is too big for one day and one >>> hour. I >>> do not have much experience on spark stream, so what is the window >>> length in >>> your environment? >>> >>> Any official docs talking about this? >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >