Re: What is the equivalent of Spark RDD is Flink

Chiwan Park Wed, 30 Dec 2015 20:08:59 -0800

About question 1,

Scheduling once for iterative job is one of factors causing performance 
difference. Dongwon’s slides [1] would be helpful other factors of performance.


[1] 
http://flink-forward.org/?session=a-comparative-performance-evaluation-of-flink

> On Dec 31, 2015, at 9:37 AM, Stephan Ewen <se...@apache.org> wrote:
> 
> Concerning question (2):
> 
> DataSets in Flink are in most cases not materialized at all, but they 
> represent in-flight data as it is being streamed from one operation to the 
> next (remember, Flink is streaming in its core). So even in a MapReduce style 
> program, the DataSet produced by the Map Function does never exist as a 
> whole, but is continuously produced and streamed to the ReduceFunction.
> 
> The operator that executes the ReduceFunction materializes the data as part 
> of its sorting operation. All materializing batch operations (sort / hash / 
> cache / ...) can go out of core very reliably.
> 
> Greetings,
> Stephan
> 
> 
> 
> On Wed, Dec 30, 2015 at 4:45 AM, Sourav Mazumder 
> <sourav.mazumde...@gmail.com> wrote:
> Hi Aljoscha and Chiwan,
> 
> Firstly thanks for the inputs.
> 
> Couple of follow ups -
> 
> 1. Based on Chiwan's explanation and the links my understanding is potential 
> performance difference may happen between Spark and Flink (during iterative 
> computation like building a model using a Machine Learning algorithm) across 
> two iterations because of the overhead of starting a new set of 
> tasks/operators.Other overheads would be same as both stores the intermediate 
> results in memory. Is this understanding correct ?
> 
> 2. In case of Flink what happens if a DataSet needs to contain data which is 
> volume wise more than total memory available in all the slave nodes ? Will it 
> serialize the memory in the disks of respective slave nodes by default ?
> 
> Regards,
> Sourav
> 
> 
> On Mon, Dec 28, 2015 at 4:13 PM, Chiwan Park <chiwanp...@apache.org> wrote:
> Hi Filip,
> 
> Spark executes job also lazily. But It is slightly different from Flink. 
> Flink can execute lazily a whole job which Spark cannot execute lazily. One 
> of example is iterative job.
> 
> In Spark, each stage of the iteration is submitted, scheduled as a job and 
> executed because of calling action in last of each iteration. In Flink, 
> although the job contains iteration, user submits only a job. Flink cluster 
> schedules and runs the job once.
> 
> Because of this difference, in Spark, user must determine something more such 
> as “Which RDDs are cached or uncached?”.
> 
> In 22 page and 23 page of ApacheCon EU 2014 slides [1] and Fabian’s answer in 
> SO [2] would be helpful to understand this differences. :)
> 
> [1]: http://www.slideshare.net/GyulaFra/flink-apachecon
> [2]: 
> http://stackoverflow.com/questions/29780747/apache-flink-vs-apache-spark-as-platforms-for-large-scale-machine-learning
> 
> > On Dec 29, 2015, at 1:35 AM, Filip Łęczycki <filipleczy...@gmail.com> wrote:
> >
> > Hi Aljoscha,
> >
> > Sorry for a little off-topic, but I wanted to calrify whether my 
> > understanding is right. You said that "Contrary to Spark, a Flink job is 
> > executed lazily", however as I read in available sources, for example 
> > http://spark.apache.org/docs/latest/programming-guide.html, chapter "RDD 
> > operations" : ". The transformations are only computed when an action 
> > requires a result to be returned to the driver program.". To my 
> > understanding Spark implements the same lazy execution principle as Flink, 
> > that is the job is only executed when a data sink/action/execute is called 
> > and before that only a execution plan is built. Is that correct or are 
> > there other significant differences between Spark and Flink lazy execution 
> > approach that I failed to grasp?
> >
> > Best regards,
> > Filip Łęczycki
> >
> > Pozdrawiam,
> > Filip Łęczycki
> >
> > 2015-12-25 10:17 GMT+01:00 Aljoscha Krettek <aljos...@apache.org>:
> > Hi Sourav,
> > you are right, in Flink the equivalent to an RDD would be a DataSet (or a 
> > DataStream if you are working with the streaming API).
> >
> > Contrary to Spark, a Flink job is executed lazily when 
> > ExecutionEnvironment.execute() is called. Only then does Flink build an 
> > executable program from the graph of transformations that was built by 
> > calling the transformation methods on DataSet. That’s why I called it lazy. 
> > The operations will also be automatically parallelized. The parallelism of 
> > operations can either be configured in the cluster configuration 
> > (conf/flink-conf.yaml), on a per job basis 
> > (ExecutionEnvironment.setParallelism(int)) or per operation, by calling 
> > setParallelism(int) on a DataSet.
> >
> > (Above you can always replace DataSet by DataStream, the same explanations 
> > hold.)
> >
> > So, to get back to your question, yes, the operation of reading the file 
> > (or files in a directory) will be parallelized to several worker nodes 
> > based on the previously mentioned settings.
> >
> > Let us now if you need more information.
> >
> > Cheers,
> > Aljoscha
> >
> > On Thu, 24 Dec 2015 at 16:49 Sourav Mazumder <sourav.mazumde...@gmail.com> 
> > wrote:
> > Hi,
> >
> > I am new to Flink. Trying to understand some of the basics of Flink.
> >
> > What is the equivalent of Spark's RDD in Flink ? In my understanding the 
> > closes think is DataSet API. But wanted to reconfirm.
> >
> > Also using DataSet API if I ingest a large volume of data (val lines : 
> > DataSet[String] = env.readTextFile(<some file path and name>)), which may 
> > not fit in single slave node, will that data get automatically distributed 
> > in the memory of other slave nodes ?
> >
> > Regards,
> > Sourav
> >
> 
> Regards,
> Chiwan Park

Regards,
Chiwan Park

Re: What is the equivalent of Spark RDD is Flink

Reply via email to