Ah yes, technically the streaming mode could run batch jobs as well in Flink. I am thinking that it could cause confusion with users since most systems that does batch and stream (well, pretty much Spark ^_^) does not differentiate the deployment topologies for the cluster to support different modes of applications.
- Henry On Tue, May 26, 2015 at 11:44 AM, Stephan Ewen <se...@apache.org> wrote: > The streaming mode runs batch jobs as well :-) > > There should be slightly reduced predictability in the memory management in > the streaming mode, but otherwise there should not be a problem. > > So if you want to run mixed workloads, you start the streaming mode. > > > (Note: Currently, the batch mode runs streaming jobs as well, but gives > them very little memory. I am thinking of prohibiting that (separate > discussion), to prevent people from not noticing that and running a highly > sub-optimal Flink setup.) > > > On Tue, May 26, 2015 at 8:26 PM, Henry Saputra <henry.sapu...@gmail.com> > wrote: > >> One immediate concern I have is the deployment topology. With >> streaming has its own cluster deployment, this means that in >> standalone mode, if ops would like to deploy Flink it has to know what >> mode it needs to deploy Flink as, either batch or Streaming. So, if >> the use case was to support both batch and streaming, would that mean >> the deployment need to separate 2 clusters to support different >> applications to run on Flink? >> >> I think this would be ok if Flink is deployed in YARN or other >> resource management platforms like Mesos or Apache Myriad. Maybe >> someone, like Robert, could confirm this is the case. >> >> - Henry >> >> On Tue, May 26, 2015 at 1:51 AM, Maximilian Michels <m...@apache.org> >> wrote: >> > +1 great changes coming up! I like the idea that, ultimately, Flink will >> > handle streaming and batch programs equally well independently of the >> > chosen cluster startup mode. >> > >> > What is the time frame for these changes? >> > >> > On Tue, May 26, 2015 at 7:34 AM, Henry Saputra <henry.sapu...@gmail.com> >> > wrote: >> > >> >> Thanks Aljoscha and Stephan, this helps >> >> >> >> - Henry >> >> >> >> On Fri, May 22, 2015 at 4:37 AM, Stephan Ewen <se...@apache.org> wrote: >> >> > Aljoscha is right. There are plans to migrate the streaming state to >> the >> >> > MemoryManager as well, but streaming state is not managed at this >> point. >> >> > >> >> > What is managed in streaming jobs is the data buffered and cached in >> the >> >> > network stack. But that is a different memory pool than the memory >> >> manager. >> >> > We keep those pools separate because the network stack is currently >> more >> >> > advanced in terms of dynamically rebalancing memory, compared to the >> >> memory >> >> > manager. >> >> > >> >> > On Fri, May 22, 2015 at 12:25 PM, Aljoscha Krettek < >> aljos...@apache.org> >> >> > wrote: >> >> > >> >> >> Hi, >> >> >> streaming currently does not use any memory manager. All state is >> kept >> >> >> in Java Objects on the Java Heap, for example an ArrayList<> for the >> >> >> window buffer. >> >> >> >> >> >> On Thu, May 21, 2015 at 11:56 PM, Henry Saputra < >> >> henry.sapu...@gmail.com> >> >> >> wrote: >> >> >> > Hi Stephan, Gyula, Paris, >> >> >> > >> >> >> > How does streaming currently different in term of memory >> management? >> >> >> > Currently we only have one MemoryManager which is used by both >> modes I >> >> >> > believe. >> >> >> > >> >> >> > - Henry >> >> >> > >> >> >> > On Thu, May 21, 2015 at 12:34 PM, Stephan Ewen <se...@apache.org> >> >> wrote: >> >> >> >> I discussed a bit via Skype with Gyula and Paris. >> >> >> >> >> >> >> >> >> >> >> >> We thought about the following way to do it: >> >> >> >> >> >> >> >> - We add a dedicated streaming mode for now. The streaming mode >> >> >> supersedes >> >> >> >> the batch mode, so it can run both type of programs. >> >> >> >> >> >> >> >> - The streaming mode sets the memory manager to "lazy >> allocation". >> >> >> >> -> So long as it runs pure streaming jobs, the full heap will >> be >> >> >> >> available to window buffers and UDFs. >> >> >> >> -> Batch programs can still run, so mixed workloads are not >> >> >> prevented. >> >> >> >> Batch programs are a bit less robust there, because the memory >> >> manager >> >> >> does >> >> >> >> not pre-allocate memory. UDFs can eat into Flink's memory portion. >> >> >> >> >> >> >> >> - The streaming mode starts the necessary configured >> >> >> components/services >> >> >> >> for state backups >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Over the next versions, we want to bring these things together: >> >> >> >> - use the managed memory for window buffers >> >> >> >> - on-demand starting of the state backend >> >> >> >> >> >> >> >> Then, we deprecate the streaming mode, let both modes start the >> >> cluster >> >> >> in >> >> >> >> the same way. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Thu, May 21, 2015 at 4:01 PM, Aljoscha Krettek < >> >> aljos...@apache.org> >> >> >> >> wrote: >> >> >> >> >> >> >> >>> Would it not be possible to start the snapshot service once the >> user >> >> >> >>> starts the first streaming job? About 2) with checkpointing >> coming >> >> up, >> >> >> >>> would it not make sense to shift to managed memory rather sooner >> >> than >> >> >> >>> later. Then this point would become moot. >> >> >> >>> >> >> >> >>> On Thu, May 21, 2015 at 3:47 PM, Matthias J. Sax >> >> >> >>> <mj...@informatik.hu-berlin.de> wrote: >> >> >> >>> > What would be the consequences on "mixed" programs? (If there >> is >> >> any >> >> >> >>> > plan to support those?) >> >> >> >>> > >> >> >> >>> > Would it be necessary to have a third mode? Or would those >> >> programs >> >> >> >>> > simple run in streaming mode? >> >> >> >>> > >> >> >> >>> > -Matthias >> >> >> >>> > >> >> >> >>> > On 05/21/2015 03:12 PM, Stephan Ewen wrote: >> >> >> >>> >> Hi all! >> >> >> >>> >> >> >> >> >>> >> We discussed a while back about introducing a dedicated >> streaming >> >> >> mode >> >> >> >>> for >> >> >> >>> >> Flink. I would like to take a go at this and implement the >> >> changes, >> >> >> but >> >> >> >>> >> discuss them before. >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> Here is a brief summary why we wanted to introduce the >> dedicated >> >> >> >>> streaming >> >> >> >>> >> mode: >> >> >> >>> >> Even though both batch and streaming are executed by the same >> >> >> execution >> >> >> >>> >> engine, >> >> >> >>> >> a streaming setup of Flink varies a bit from a batch setup: >> >> >> >>> >> >> >> >> >>> >> 1) The streaming cluster starts an additional service to store >> >> the >> >> >> >>> >> distributed state snapshots. >> >> >> >>> >> >> >> >> >>> >> 2) Streaming mode uses memory a bit different, so we should >> >> >> configure >> >> >> >>> the >> >> >> >>> >> memory manager differently. This difference may eventually go >> >> away. >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> Concretely, to implement this, I was thinking about >> introducing >> >> the >> >> >> >>> >> following externally visible changes >> >> >> >>> >> >> >> >> >>> >> - Additional scripts "start-streaming-cluster.sh" and >> >> >> >>> >> "start-streaming-local.sh" >> >> >> >>> >> >> >> >> >>> >> - An execution mode parameter for the TaskManager ("batch / >> >> >> streaming") >> >> >> >>> >> >> >> >> >>> >> - An execution mode parameter for the JobManager TaskManager >> >> >> ("batch / >> >> >> >>> >> streaming") >> >> >> >>> >> >> >> >> >>> >> - All local executors and mini clusters need a flag that >> >> specifies >> >> >> >>> whether >> >> >> >>> >> they will start >> >> >> >>> >> a streaming cluster, or a pure batch cluster. >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> Anything else that comes to your minds? >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> Greetings, >> >> >> >>> >> Stephan >> >> >> >>> >> >> >> >> >>> > >> >> >> >>> >> >> >> >> >> >>