About question 1, Scheduling once for iterative job is one of factors causing performance difference. Dongwon’s slides [1] would be helpful other factors of performance.
[1] http://flink-forward.org/?session=a-comparative-performance-evaluation-of-flink > On Dec 31, 2015, at 9:37 AM, Stephan Ewen <se...@apache.org> wrote: > > Concerning question (2): > > DataSets in Flink are in most cases not materialized at all, but they > represent in-flight data as it is being streamed from one operation to the > next (remember, Flink is streaming in its core). So even in a MapReduce style > program, the DataSet produced by the Map Function does never exist as a > whole, but is continuously produced and streamed to the ReduceFunction. > > The operator that executes the ReduceFunction materializes the data as part > of its sorting operation. All materializing batch operations (sort / hash / > cache / ...) can go out of core very reliably. > > Greetings, > Stephan > > > > On Wed, Dec 30, 2015 at 4:45 AM, Sourav Mazumder > <sourav.mazumde...@gmail.com> wrote: > Hi Aljoscha and Chiwan, > > Firstly thanks for the inputs. > > Couple of follow ups - > > 1. Based on Chiwan's explanation and the links my understanding is potential > performance difference may happen between Spark and Flink (during iterative > computation like building a model using a Machine Learning algorithm) across > two iterations because of the overhead of starting a new set of > tasks/operators.Other overheads would be same as both stores the intermediate > results in memory. Is this understanding correct ? > > 2. In case of Flink what happens if a DataSet needs to contain data which is > volume wise more than total memory available in all the slave nodes ? Will it > serialize the memory in the disks of respective slave nodes by default ? > > Regards, > Sourav > > > On Mon, Dec 28, 2015 at 4:13 PM, Chiwan Park <chiwanp...@apache.org> wrote: > Hi Filip, > > Spark executes job also lazily. But It is slightly different from Flink. > Flink can execute lazily a whole job which Spark cannot execute lazily. One > of example is iterative job. > > In Spark, each stage of the iteration is submitted, scheduled as a job and > executed because of calling action in last of each iteration. In Flink, > although the job contains iteration, user submits only a job. Flink cluster > schedules and runs the job once. > > Because of this difference, in Spark, user must determine something more such > as “Which RDDs are cached or uncached?”. > > In 22 page and 23 page of ApacheCon EU 2014 slides [1] and Fabian’s answer in > SO [2] would be helpful to understand this differences. :) > > [1]: http://www.slideshare.net/GyulaFra/flink-apachecon > [2]: > http://stackoverflow.com/questions/29780747/apache-flink-vs-apache-spark-as-platforms-for-large-scale-machine-learning > > > On Dec 29, 2015, at 1:35 AM, Filip Łęczycki <filipleczy...@gmail.com> wrote: > > > > Hi Aljoscha, > > > > Sorry for a little off-topic, but I wanted to calrify whether my > > understanding is right. You said that "Contrary to Spark, a Flink job is > > executed lazily", however as I read in available sources, for example > > http://spark.apache.org/docs/latest/programming-guide.html, chapter "RDD > > operations" : ". The transformations are only computed when an action > > requires a result to be returned to the driver program.". To my > > understanding Spark implements the same lazy execution principle as Flink, > > that is the job is only executed when a data sink/action/execute is called > > and before that only a execution plan is built. Is that correct or are > > there other significant differences between Spark and Flink lazy execution > > approach that I failed to grasp? > > > > Best regards, > > Filip Łęczycki > > > > Pozdrawiam, > > Filip Łęczycki > > > > 2015-12-25 10:17 GMT+01:00 Aljoscha Krettek <aljos...@apache.org>: > > Hi Sourav, > > you are right, in Flink the equivalent to an RDD would be a DataSet (or a > > DataStream if you are working with the streaming API). > > > > Contrary to Spark, a Flink job is executed lazily when > > ExecutionEnvironment.execute() is called. Only then does Flink build an > > executable program from the graph of transformations that was built by > > calling the transformation methods on DataSet. That’s why I called it lazy. > > The operations will also be automatically parallelized. The parallelism of > > operations can either be configured in the cluster configuration > > (conf/flink-conf.yaml), on a per job basis > > (ExecutionEnvironment.setParallelism(int)) or per operation, by calling > > setParallelism(int) on a DataSet. > > > > (Above you can always replace DataSet by DataStream, the same explanations > > hold.) > > > > So, to get back to your question, yes, the operation of reading the file > > (or files in a directory) will be parallelized to several worker nodes > > based on the previously mentioned settings. > > > > Let us now if you need more information. > > > > Cheers, > > Aljoscha > > > > On Thu, 24 Dec 2015 at 16:49 Sourav Mazumder <sourav.mazumde...@gmail.com> > > wrote: > > Hi, > > > > I am new to Flink. Trying to understand some of the basics of Flink. > > > > What is the equivalent of Spark's RDD in Flink ? In my understanding the > > closes think is DataSet API. But wanted to reconfirm. > > > > Also using DataSet API if I ingest a large volume of data (val lines : > > DataSet[String] = env.readTextFile(<some file path and name>)), which may > > not fit in single slave node, will that data get automatically distributed > > in the memory of other slave nodes ? > > > > Regards, > > Sourav > > > > Regards, > Chiwan Park Regards, Chiwan Park