Concerning question (2): DataSets in Flink are in most cases not materialized at all, but they represent in-flight data as it is being streamed from one operation to the next (remember, Flink is streaming in its core). So even in a MapReduce style program, the DataSet produced by the Map Function does never exist as a whole, but is continuously produced and streamed to the ReduceFunction.
The operator that executes the ReduceFunction materializes the data as part of its sorting operation. All materializing batch operations (sort / hash / cache / ...) can go out of core very reliably. Greetings, Stephan On Wed, Dec 30, 2015 at 4:45 AM, Sourav Mazumder < sourav.mazumde...@gmail.com> wrote: > Hi Aljoscha and Chiwan, > > Firstly thanks for the inputs. > > Couple of follow ups - > > 1. Based on Chiwan's explanation and the links my understanding is > potential performance difference may happen between Spark and Flink (during > iterative computation like building a model using a Machine Learning > algorithm) across two iterations because of the overhead of starting a new > set of tasks/operators.Other overheads would be same as both stores the > intermediate results in memory. Is this understanding correct ? > > 2. In case of Flink what happens if a DataSet needs to contain data which > is volume wise more than total memory available in all the slave nodes ? > Will it serialize the memory in the disks of respective slave nodes by > default ? > > Regards, > Sourav > > > On Mon, Dec 28, 2015 at 4:13 PM, Chiwan Park <chiwanp...@apache.org> > wrote: > >> Hi Filip, >> >> Spark executes job also lazily. But It is slightly different from Flink. >> Flink can execute lazily a whole job which Spark cannot execute lazily. One >> of example is iterative job. >> >> In Spark, each stage of the iteration is submitted, scheduled as a job >> and executed because of calling action in last of each iteration. In Flink, >> although the job contains iteration, user submits only a job. Flink cluster >> schedules and runs the job once. >> >> Because of this difference, in Spark, user must determine something more >> such as “Which RDDs are cached or uncached?”. >> >> In 22 page and 23 page of ApacheCon EU 2014 slides [1] and Fabian’s >> answer in SO [2] would be helpful to understand this differences. :) >> >> [1]: http://www.slideshare.net/GyulaFra/flink-apachecon >> [2]: >> http://stackoverflow.com/questions/29780747/apache-flink-vs-apache-spark-as-platforms-for-large-scale-machine-learning >> >> > On Dec 29, 2015, at 1:35 AM, Filip Łęczycki <filipleczy...@gmail.com> >> wrote: >> > >> > Hi Aljoscha, >> > >> > Sorry for a little off-topic, but I wanted to calrify whether my >> understanding is right. You said that "Contrary to Spark, a Flink job is >> executed lazily", however as I read in available sources, for example >> http://spark.apache.org/docs/latest/programming-guide.html, chapter "RDD >> operations" : ". The transformations are only computed when an action >> requires a result to be returned to the driver program.". To my >> understanding Spark implements the same lazy execution principle as Flink, >> that is the job is only executed when a data sink/action/execute is called >> and before that only a execution plan is built. Is that correct or are >> there other significant differences between Spark and Flink lazy execution >> approach that I failed to grasp? >> > >> > Best regards, >> > Filip Łęczycki >> > >> > Pozdrawiam, >> > Filip Łęczycki >> > >> > 2015-12-25 10:17 GMT+01:00 Aljoscha Krettek <aljos...@apache.org>: >> > Hi Sourav, >> > you are right, in Flink the equivalent to an RDD would be a DataSet (or >> a DataStream if you are working with the streaming API). >> > >> > Contrary to Spark, a Flink job is executed lazily when >> ExecutionEnvironment.execute() is called. Only then does Flink build an >> executable program from the graph of transformations that was built by >> calling the transformation methods on DataSet. That’s why I called it lazy. >> The operations will also be automatically parallelized. The parallelism of >> operations can either be configured in the cluster configuration >> (conf/flink-conf.yaml), on a per job basis >> (ExecutionEnvironment.setParallelism(int)) or per operation, by calling >> setParallelism(int) on a DataSet. >> > >> > (Above you can always replace DataSet by DataStream, the same >> explanations hold.) >> > >> > So, to get back to your question, yes, the operation of reading the >> file (or files in a directory) will be parallelized to several worker nodes >> based on the previously mentioned settings. >> > >> > Let us now if you need more information. >> > >> > Cheers, >> > Aljoscha >> > >> > On Thu, 24 Dec 2015 at 16:49 Sourav Mazumder < >> sourav.mazumde...@gmail.com> wrote: >> > Hi, >> > >> > I am new to Flink. Trying to understand some of the basics of Flink. >> > >> > What is the equivalent of Spark's RDD in Flink ? In my understanding >> the closes think is DataSet API. But wanted to reconfirm. >> > >> > Also using DataSet API if I ingest a large volume of data (val lines : >> DataSet[String] = env.readTextFile(<some file path and name>)), which may >> not fit in single slave node, will that data get automatically distributed >> in the memory of other slave nodes ? >> > >> > Regards, >> > Sourav >> > >> >> Regards, >> Chiwan Park >> >> >> >> >