+1 great changes coming up! I like the idea that, ultimately, Flink will handle streaming and batch programs equally well independently of the chosen cluster startup mode.
What is the time frame for these changes? On Tue, May 26, 2015 at 7:34 AM, Henry Saputra <henry.sapu...@gmail.com> wrote: > Thanks Aljoscha and Stephan, this helps > > - Henry > > On Fri, May 22, 2015 at 4:37 AM, Stephan Ewen <se...@apache.org> wrote: > > Aljoscha is right. There are plans to migrate the streaming state to the > > MemoryManager as well, but streaming state is not managed at this point. > > > > What is managed in streaming jobs is the data buffered and cached in the > > network stack. But that is a different memory pool than the memory > manager. > > We keep those pools separate because the network stack is currently more > > advanced in terms of dynamically rebalancing memory, compared to the > memory > > manager. > > > > On Fri, May 22, 2015 at 12:25 PM, Aljoscha Krettek <aljos...@apache.org> > > wrote: > > > >> Hi, > >> streaming currently does not use any memory manager. All state is kept > >> in Java Objects on the Java Heap, for example an ArrayList<> for the > >> window buffer. > >> > >> On Thu, May 21, 2015 at 11:56 PM, Henry Saputra < > henry.sapu...@gmail.com> > >> wrote: > >> > Hi Stephan, Gyula, Paris, > >> > > >> > How does streaming currently different in term of memory management? > >> > Currently we only have one MemoryManager which is used by both modes I > >> > believe. > >> > > >> > - Henry > >> > > >> > On Thu, May 21, 2015 at 12:34 PM, Stephan Ewen <se...@apache.org> > wrote: > >> >> I discussed a bit via Skype with Gyula and Paris. > >> >> > >> >> > >> >> We thought about the following way to do it: > >> >> > >> >> - We add a dedicated streaming mode for now. The streaming mode > >> supersedes > >> >> the batch mode, so it can run both type of programs. > >> >> > >> >> - The streaming mode sets the memory manager to "lazy allocation". > >> >> -> So long as it runs pure streaming jobs, the full heap will be > >> >> available to window buffers and UDFs. > >> >> -> Batch programs can still run, so mixed workloads are not > >> prevented. > >> >> Batch programs are a bit less robust there, because the memory > manager > >> does > >> >> not pre-allocate memory. UDFs can eat into Flink's memory portion. > >> >> > >> >> - The streaming mode starts the necessary configured > >> components/services > >> >> for state backups > >> >> > >> >> > >> >> > >> >> Over the next versions, we want to bring these things together: > >> >> - use the managed memory for window buffers > >> >> - on-demand starting of the state backend > >> >> > >> >> Then, we deprecate the streaming mode, let both modes start the > cluster > >> in > >> >> the same way. > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> On Thu, May 21, 2015 at 4:01 PM, Aljoscha Krettek < > aljos...@apache.org> > >> >> wrote: > >> >> > >> >>> Would it not be possible to start the snapshot service once the user > >> >>> starts the first streaming job? About 2) with checkpointing coming > up, > >> >>> would it not make sense to shift to managed memory rather sooner > than > >> >>> later. Then this point would become moot. > >> >>> > >> >>> On Thu, May 21, 2015 at 3:47 PM, Matthias J. Sax > >> >>> <mj...@informatik.hu-berlin.de> wrote: > >> >>> > What would be the consequences on "mixed" programs? (If there is > any > >> >>> > plan to support those?) > >> >>> > > >> >>> > Would it be necessary to have a third mode? Or would those > programs > >> >>> > simple run in streaming mode? > >> >>> > > >> >>> > -Matthias > >> >>> > > >> >>> > On 05/21/2015 03:12 PM, Stephan Ewen wrote: > >> >>> >> Hi all! > >> >>> >> > >> >>> >> We discussed a while back about introducing a dedicated streaming > >> mode > >> >>> for > >> >>> >> Flink. I would like to take a go at this and implement the > changes, > >> but > >> >>> >> discuss them before. > >> >>> >> > >> >>> >> > >> >>> >> Here is a brief summary why we wanted to introduce the dedicated > >> >>> streaming > >> >>> >> mode: > >> >>> >> Even though both batch and streaming are executed by the same > >> execution > >> >>> >> engine, > >> >>> >> a streaming setup of Flink varies a bit from a batch setup: > >> >>> >> > >> >>> >> 1) The streaming cluster starts an additional service to store > the > >> >>> >> distributed state snapshots. > >> >>> >> > >> >>> >> 2) Streaming mode uses memory a bit different, so we should > >> configure > >> >>> the > >> >>> >> memory manager differently. This difference may eventually go > away. > >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> Concretely, to implement this, I was thinking about introducing > the > >> >>> >> following externally visible changes > >> >>> >> > >> >>> >> - Additional scripts "start-streaming-cluster.sh" and > >> >>> >> "start-streaming-local.sh" > >> >>> >> > >> >>> >> - An execution mode parameter for the TaskManager ("batch / > >> streaming") > >> >>> >> > >> >>> >> - An execution mode parameter for the JobManager TaskManager > >> ("batch / > >> >>> >> streaming") > >> >>> >> > >> >>> >> - All local executors and mini clusters need a flag that > specifies > >> >>> whether > >> >>> >> they will start > >> >>> >> a streaming cluster, or a pure batch cluster. > >> >>> >> > >> >>> >> > >> >>> >> Anything else that comes to your minds? > >> >>> >> > >> >>> >> > >> >>> >> Greetings, > >> >>> >> Stephan > >> >>> >> > >> >>> > > >> >>> > >> >