Hi Sourav,
you are right, in Flink the equivalent to an RDD would be a DataSet (or a
DataStream if you are working with the streaming API).

Contrary to Spark, a Flink job is executed lazily when
ExecutionEnvironment.execute() is called. Only then does Flink build an
executable program from the graph of transformations that was built by
calling the transformation methods on DataSet. That’s why I called it lazy.
The operations will also be automatically parallelized. The parallelism of
operations can either be configured in the cluster configuration
(conf/flink-conf.yaml), on a per job basis
(ExecutionEnvironment.setParallelism(int)) or per operation, by calling
setParallelism(int) on a DataSet.

(Above you can always replace DataSet by DataStream, the same explanations
hold.)

So, to get back to your question, yes, the operation of reading the file
(or files in a directory) will be parallelized to several worker nodes
based on the previously mentioned settings.

Let us now if you need more information.

Cheers,
Aljoscha

On Thu, 24 Dec 2015 at 16:49 Sourav Mazumder <sourav.mazumde...@gmail.com>
wrote:

> Hi,
>
> I am new to Flink. Trying to understand some of the basics of Flink.
>
> What is the equivalent of Spark's RDD in Flink ? In my understanding the
> closes think is DataSet API. But wanted to reconfirm.
>
> Also using DataSet API if I ingest a large volume of data (val lines :
> DataSet[String] = env.readTextFile(<some file path and name>)), which may
> not fit in single slave node, will that data get automatically distributed
> in the memory of other slave nodes ?
>
> Regards,
> Sourav
>

Reply via email to