Hi Henry! I think the idea was to have a dedicated streaming mode as long as the default cluster mode does not support batch and streaming equally well. Once we have reached this level in the dedicated streaming mode, this will be the default cluster mode. I share your doubts about whether it is a good idea to advertise the streaming mode. It might let people think that a Flink cluster can only do either one of the modes.
Best, Max On Tue, May 26, 2015 at 8:53 PM, Henry Saputra <henry.sapu...@gmail.com> wrote: > Ah yes, technically the streaming mode could run batch jobs as well in > Flink. > I am thinking that it could cause confusion with users since most > systems that does batch and stream (well, pretty much Spark ^_^) does > not differentiate the deployment topologies for the cluster to support > different modes of applications. > > - Henry > > On Tue, May 26, 2015 at 11:44 AM, Stephan Ewen <se...@apache.org> wrote: > > The streaming mode runs batch jobs as well :-) > > > > There should be slightly reduced predictability in the memory management > in > > the streaming mode, but otherwise there should not be a problem. > > > > So if you want to run mixed workloads, you start the streaming mode. > > > > > > (Note: Currently, the batch mode runs streaming jobs as well, but gives > > them very little memory. I am thinking of prohibiting that (separate > > discussion), to prevent people from not noticing that and running a > highly > > sub-optimal Flink setup.) > > > > > > On Tue, May 26, 2015 at 8:26 PM, Henry Saputra <henry.sapu...@gmail.com> > > wrote: > > > >> One immediate concern I have is the deployment topology. With > >> streaming has its own cluster deployment, this means that in > >> standalone mode, if ops would like to deploy Flink it has to know what > >> mode it needs to deploy Flink as, either batch or Streaming. So, if > >> the use case was to support both batch and streaming, would that mean > >> the deployment need to separate 2 clusters to support different > >> applications to run on Flink? > >> > >> I think this would be ok if Flink is deployed in YARN or other > >> resource management platforms like Mesos or Apache Myriad. Maybe > >> someone, like Robert, could confirm this is the case. > >> > >> - Henry > >> > >> On Tue, May 26, 2015 at 1:51 AM, Maximilian Michels <m...@apache.org> > >> wrote: > >> > +1 great changes coming up! I like the idea that, ultimately, Flink > will > >> > handle streaming and batch programs equally well independently of the > >> > chosen cluster startup mode. > >> > > >> > What is the time frame for these changes? > >> > > >> > On Tue, May 26, 2015 at 7:34 AM, Henry Saputra < > henry.sapu...@gmail.com> > >> > wrote: > >> > > >> >> Thanks Aljoscha and Stephan, this helps > >> >> > >> >> - Henry > >> >> > >> >> On Fri, May 22, 2015 at 4:37 AM, Stephan Ewen <se...@apache.org> > wrote: > >> >> > Aljoscha is right. There are plans to migrate the streaming state > to > >> the > >> >> > MemoryManager as well, but streaming state is not managed at this > >> point. > >> >> > > >> >> > What is managed in streaming jobs is the data buffered and cached > in > >> the > >> >> > network stack. But that is a different memory pool than the memory > >> >> manager. > >> >> > We keep those pools separate because the network stack is currently > >> more > >> >> > advanced in terms of dynamically rebalancing memory, compared to > the > >> >> memory > >> >> > manager. > >> >> > > >> >> > On Fri, May 22, 2015 at 12:25 PM, Aljoscha Krettek < > >> aljos...@apache.org> > >> >> > wrote: > >> >> > > >> >> >> Hi, > >> >> >> streaming currently does not use any memory manager. All state is > >> kept > >> >> >> in Java Objects on the Java Heap, for example an ArrayList<> for > the > >> >> >> window buffer. > >> >> >> > >> >> >> On Thu, May 21, 2015 at 11:56 PM, Henry Saputra < > >> >> henry.sapu...@gmail.com> > >> >> >> wrote: > >> >> >> > Hi Stephan, Gyula, Paris, > >> >> >> > > >> >> >> > How does streaming currently different in term of memory > >> management? > >> >> >> > Currently we only have one MemoryManager which is used by both > >> modes I > >> >> >> > believe. > >> >> >> > > >> >> >> > - Henry > >> >> >> > > >> >> >> > On Thu, May 21, 2015 at 12:34 PM, Stephan Ewen < > se...@apache.org> > >> >> wrote: > >> >> >> >> I discussed a bit via Skype with Gyula and Paris. > >> >> >> >> > >> >> >> >> > >> >> >> >> We thought about the following way to do it: > >> >> >> >> > >> >> >> >> - We add a dedicated streaming mode for now. The streaming > mode > >> >> >> supersedes > >> >> >> >> the batch mode, so it can run both type of programs. > >> >> >> >> > >> >> >> >> - The streaming mode sets the memory manager to "lazy > >> allocation". > >> >> >> >> -> So long as it runs pure streaming jobs, the full heap > will > >> be > >> >> >> >> available to window buffers and UDFs. > >> >> >> >> -> Batch programs can still run, so mixed workloads are not > >> >> >> prevented. > >> >> >> >> Batch programs are a bit less robust there, because the memory > >> >> manager > >> >> >> does > >> >> >> >> not pre-allocate memory. UDFs can eat into Flink's memory > portion. > >> >> >> >> > >> >> >> >> - The streaming mode starts the necessary configured > >> >> >> components/services > >> >> >> >> for state backups > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> Over the next versions, we want to bring these things together: > >> >> >> >> - use the managed memory for window buffers > >> >> >> >> - on-demand starting of the state backend > >> >> >> >> > >> >> >> >> Then, we deprecate the streaming mode, let both modes start the > >> >> cluster > >> >> >> in > >> >> >> >> the same way. > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> On Thu, May 21, 2015 at 4:01 PM, Aljoscha Krettek < > >> >> aljos...@apache.org> > >> >> >> >> wrote: > >> >> >> >> > >> >> >> >>> Would it not be possible to start the snapshot service once > the > >> user > >> >> >> >>> starts the first streaming job? About 2) with checkpointing > >> coming > >> >> up, > >> >> >> >>> would it not make sense to shift to managed memory rather > sooner > >> >> than > >> >> >> >>> later. Then this point would become moot. > >> >> >> >>> > >> >> >> >>> On Thu, May 21, 2015 at 3:47 PM, Matthias J. Sax > >> >> >> >>> <mj...@informatik.hu-berlin.de> wrote: > >> >> >> >>> > What would be the consequences on "mixed" programs? (If > there > >> is > >> >> any > >> >> >> >>> > plan to support those?) > >> >> >> >>> > > >> >> >> >>> > Would it be necessary to have a third mode? Or would those > >> >> programs > >> >> >> >>> > simple run in streaming mode? > >> >> >> >>> > > >> >> >> >>> > -Matthias > >> >> >> >>> > > >> >> >> >>> > On 05/21/2015 03:12 PM, Stephan Ewen wrote: > >> >> >> >>> >> Hi all! > >> >> >> >>> >> > >> >> >> >>> >> We discussed a while back about introducing a dedicated > >> streaming > >> >> >> mode > >> >> >> >>> for > >> >> >> >>> >> Flink. I would like to take a go at this and implement the > >> >> changes, > >> >> >> but > >> >> >> >>> >> discuss them before. > >> >> >> >>> >> > >> >> >> >>> >> > >> >> >> >>> >> Here is a brief summary why we wanted to introduce the > >> dedicated > >> >> >> >>> streaming > >> >> >> >>> >> mode: > >> >> >> >>> >> Even though both batch and streaming are executed by the > same > >> >> >> execution > >> >> >> >>> >> engine, > >> >> >> >>> >> a streaming setup of Flink varies a bit from a batch setup: > >> >> >> >>> >> > >> >> >> >>> >> 1) The streaming cluster starts an additional service to > store > >> >> the > >> >> >> >>> >> distributed state snapshots. > >> >> >> >>> >> > >> >> >> >>> >> 2) Streaming mode uses memory a bit different, so we should > >> >> >> configure > >> >> >> >>> the > >> >> >> >>> >> memory manager differently. This difference may eventually > go > >> >> away. > >> >> >> >>> >> > >> >> >> >>> >> > >> >> >> >>> >> > >> >> >> >>> >> Concretely, to implement this, I was thinking about > >> introducing > >> >> the > >> >> >> >>> >> following externally visible changes > >> >> >> >>> >> > >> >> >> >>> >> - Additional scripts "start-streaming-cluster.sh" and > >> >> >> >>> >> "start-streaming-local.sh" > >> >> >> >>> >> > >> >> >> >>> >> - An execution mode parameter for the TaskManager ("batch > / > >> >> >> streaming") > >> >> >> >>> >> > >> >> >> >>> >> - An execution mode parameter for the JobManager > TaskManager > >> >> >> ("batch / > >> >> >> >>> >> streaming") > >> >> >> >>> >> > >> >> >> >>> >> - All local executors and mini clusters need a flag that > >> >> specifies > >> >> >> >>> whether > >> >> >> >>> >> they will start > >> >> >> >>> >> a streaming cluster, or a pure batch cluster. > >> >> >> >>> >> > >> >> >> >>> >> > >> >> >> >>> >> Anything else that comes to your minds? > >> >> >> >>> >> > >> >> >> >>> >> > >> >> >> >>> >> Greetings, > >> >> >> >>> >> Stephan > >> >> >> >>> >> > >> >> >> >>> > > >> >> >> >>> > >> >> >> > >> >> > >> >