Hi Aljoscha,

Sorry for a little off-topic, but I wanted to calrify whether my
understanding is right. You said that "Contrary to Spark, a Flink job is
executed lazily", however as I read in available sources, for example
http://spark.apache.org/docs/latest/programming-guide.html, chapter "RDD
operations" : ". The transformations are only computed when an action
requires a result to be returned to the driver program.". To my
understanding Spark implements the same lazy execution principle as Flink,
that is the job is only executed when a data sink/action/execute is called
and before that only a execution plan is built. Is that correct or are
there other significant differences between Spark and Flink lazy execution
approach that I failed to grasp?

Best regards,
Filip Łęczycki

Pozdrawiam,
Filip Łęczycki

2015-12-25 10:17 GMT+01:00 Aljoscha Krettek <aljos...@apache.org>:

> Hi Sourav,
> you are right, in Flink the equivalent to an RDD would be a DataSet (or a
> DataStream if you are working with the streaming API).
>
> Contrary to Spark, a Flink job is executed lazily when
> ExecutionEnvironment.execute() is called. Only then does Flink build an
> executable program from the graph of transformations that was built by
> calling the transformation methods on DataSet. That’s why I called it lazy.
> The operations will also be automatically parallelized. The parallelism of
> operations can either be configured in the cluster configuration
> (conf/flink-conf.yaml), on a per job basis
> (ExecutionEnvironment.setParallelism(int)) or per operation, by calling
> setParallelism(int) on a DataSet.
>
> (Above you can always replace DataSet by DataStream, the same explanations
> hold.)
>
> So, to get back to your question, yes, the operation of reading the file
> (or files in a directory) will be parallelized to several worker nodes
> based on the previously mentioned settings.
>
> Let us now if you need more information.
>
> Cheers,
> Aljoscha
>
> On Thu, 24 Dec 2015 at 16:49 Sourav Mazumder <sourav.mazumde...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am new to Flink. Trying to understand some of the basics of Flink.
>>
>> What is the equivalent of Spark's RDD in Flink ? In my understanding the
>> closes think is DataSet API. But wanted to reconfirm.
>>
>> Also using DataSet API if I ingest a large volume of data (val lines :
>> DataSet[String] = env.readTextFile(<some file path and name>)), which may
>> not fit in single slave node, will that data get automatically distributed
>> in the memory of other slave nodes ?
>>
>> Regards,
>> Sourav
>>
>

Reply via email to