Re: What is the equivalent of Spark RDD is Flink

Chiwan Park Mon, 28 Dec 2015 16:14:12 -0800

Hi Filip,

Spark executes job also lazily. But It is slightly different from Flink. Flink 
can execute lazily a whole job which Spark cannot execute lazily. One of 
example is iterative job.


In Spark, each stage of the iteration is submitted, scheduled as a job and 
executed because of calling action in last of each iteration. In Flink, 
although the job contains iteration, user submits only a job. Flink cluster 
schedules and runs the job once.

Because of this difference, in Spark, user must determine something more such 
as “Which RDDs are cached or uncached?”.

In 22 page and 23 page of ApacheCon EU 2014 slides [1] and Fabian’s answer in 
SO [2] would be helpful to understand this differences. :)

[1]: http://www.slideshare.net/GyulaFra/flink-apachecon
[2]: 
http://stackoverflow.com/questions/29780747/apache-flink-vs-apache-spark-as-platforms-for-large-scale-machine-learning

> On Dec 29, 2015, at 1:35 AM, Filip Łęczycki <filipleczy...@gmail.com> wrote:
> 
> Hi Aljoscha,
> 
> Sorry for a little off-topic, but I wanted to calrify whether my 
> understanding is right. You said that "Contrary to Spark, a Flink job is 
> executed lazily", however as I read in available sources, for example 
> http://spark.apache.org/docs/latest/programming-guide.html, chapter "RDD 
> operations" : ". The transformations are only computed when an action 
> requires a result to be returned to the driver program.". To my understanding 
> Spark implements the same lazy execution principle as Flink, that is the job 
> is only executed when a data sink/action/execute is called and before that 
> only a execution plan is built. Is that correct or are there other 
> significant differences between Spark and Flink lazy execution approach that 
> I failed to grasp?
> 
> Best regards,
> Filip Łęczycki
> 
> Pozdrawiam,
> Filip Łęczycki
> 
> 2015-12-25 10:17 GMT+01:00 Aljoscha Krettek <aljos...@apache.org>:
> Hi Sourav,
> you are right, in Flink the equivalent to an RDD would be a DataSet (or a 
> DataStream if you are working with the streaming API).
> 
> Contrary to Spark, a Flink job is executed lazily when 
> ExecutionEnvironment.execute() is called. Only then does Flink build an 
> executable program from the graph of transformations that was built by 
> calling the transformation methods on DataSet. That’s why I called it lazy. 
> The operations will also be automatically parallelized. The parallelism of 
> operations can either be configured in the cluster configuration 
> (conf/flink-conf.yaml), on a per job basis 
> (ExecutionEnvironment.setParallelism(int)) or per operation, by calling 
> setParallelism(int) on a DataSet.
> 
> (Above you can always replace DataSet by DataStream, the same explanations 
> hold.)
> 
> So, to get back to your question, yes, the operation of reading the file (or 
> files in a directory) will be parallelized to several worker nodes based on 
> the previously mentioned settings.
> 
> Let us now if you need more information.
> 
> Cheers,
> Aljoscha
> 
> On Thu, 24 Dec 2015 at 16:49 Sourav Mazumder <sourav.mazumde...@gmail.com> 
> wrote:
> Hi,
> 
> I am new to Flink. Trying to understand some of the basics of Flink.
> 
> What is the equivalent of Spark's RDD in Flink ? In my understanding the 
> closes think is DataSet API. But wanted to reconfirm.
> 
> Also using DataSet API if I ingest a large volume of data (val lines : 
> DataSet[String] = env.readTextFile(<some file path and name>)), which may not 
> fit in single slave node, will that data get automatically distributed in the 
> memory of other slave nodes ?
> 
> Regards,
> Sourav
> 

Regards,
Chiwan Park

Re: What is the equivalent of Spark RDD is Flink

Reply via email to