Re: What is the equivalent of Spark RDD is Flink

Sourav Mazumder Tue, 29 Dec 2015 19:46:43 -0800

Hi Aljoscha and Chiwan,

Firstly thanks for the inputs.


Couple of follow ups -

1. Based on Chiwan's explanation and the links my understanding is
potential performance difference may happen between Spark and Flink (during
iterative computation like building a model using a Machine Learning
algorithm) across two iterations because of the overhead of starting a new
set of tasks/operators.Other overheads would be same as both stores the
intermediate results in memory. Is this understanding correct ?

2. In case of Flink what happens if a DataSet needs to contain data which
is volume wise more than total memory available in all the slave nodes ?
Will it serialize the memory in the disks of respective slave nodes by
default ?

Regards,
Sourav


On Mon, Dec 28, 2015 at 4:13 PM, Chiwan Park <chiwanp...@apache.org> wrote:

> Hi Filip,
>
> Spark executes job also lazily. But It is slightly different from Flink.
> Flink can execute lazily a whole job which Spark cannot execute lazily. One
> of example is iterative job.
>
> In Spark, each stage of the iteration is submitted, scheduled as a job and
> executed because of calling action in last of each iteration. In Flink,
> although the job contains iteration, user submits only a job. Flink cluster
> schedules and runs the job once.
>
> Because of this difference, in Spark, user must determine something more
> such as “Which RDDs are cached or uncached?”.
>
> In 22 page and 23 page of ApacheCon EU 2014 slides [1] and Fabian’s answer
> in SO [2] would be helpful to understand this differences. :)
>
> [1]: http://www.slideshare.net/GyulaFra/flink-apachecon
> [2]:
> http://stackoverflow.com/questions/29780747/apache-flink-vs-apache-spark-as-platforms-for-large-scale-machine-learning
>
> > On Dec 29, 2015, at 1:35 AM, Filip Łęczycki <filipleczy...@gmail.com>
> wrote:
> >
> > Hi Aljoscha,
> >
> > Sorry for a little off-topic, but I wanted to calrify whether my
> understanding is right. You said that "Contrary to Spark, a Flink job is
> executed lazily", however as I read in available sources, for example
> http://spark.apache.org/docs/latest/programming-guide.html, chapter "RDD
> operations" : ". The transformations are only computed when an action
> requires a result to be returned to the driver program.". To my
> understanding Spark implements the same lazy execution principle as Flink,
> that is the job is only executed when a data sink/action/execute is called
> and before that only a execution plan is built. Is that correct or are
> there other significant differences between Spark and Flink lazy execution
> approach that I failed to grasp?
> >
> > Best regards,
> > Filip Łęczycki
> >
> > Pozdrawiam,
> > Filip Łęczycki
> >
> > 2015-12-25 10:17 GMT+01:00 Aljoscha Krettek <aljos...@apache.org>:
> > Hi Sourav,
> > you are right, in Flink the equivalent to an RDD would be a DataSet (or
> a DataStream if you are working with the streaming API).
> >
> > Contrary to Spark, a Flink job is executed lazily when
> ExecutionEnvironment.execute() is called. Only then does Flink build an
> executable program from the graph of transformations that was built by
> calling the transformation methods on DataSet. That’s why I called it lazy.
> The operations will also be automatically parallelized. The parallelism of
> operations can either be configured in the cluster configuration
> (conf/flink-conf.yaml), on a per job basis
> (ExecutionEnvironment.setParallelism(int)) or per operation, by calling
> setParallelism(int) on a DataSet.
> >
> > (Above you can always replace DataSet by DataStream, the same
> explanations hold.)
> >
> > So, to get back to your question, yes, the operation of reading the file
> (or files in a directory) will be parallelized to several worker nodes
> based on the previously mentioned settings.
> >
> > Let us now if you need more information.
> >
> > Cheers,
> > Aljoscha
> >
> > On Thu, 24 Dec 2015 at 16:49 Sourav Mazumder <
> sourav.mazumde...@gmail.com> wrote:
> > Hi,
> >
> > I am new to Flink. Trying to understand some of the basics of Flink.
> >
> > What is the equivalent of Spark's RDD in Flink ? In my understanding the
> closes think is DataSet API. But wanted to reconfirm.
> >
> > Also using DataSet API if I ingest a large volume of data (val lines :
> DataSet[String] = env.readTextFile(<some file path and name>)), which may
> not fit in single slave node, will that data get automatically distributed
> in the memory of other slave nodes ?
> >
> > Regards,
> > Sourav
> >
>
> Regards,
> Chiwan Park
>
>
>
>

Re: What is the equivalent of Spark RDD is Flink

Reply via email to