Hi Aljoscha, Sorry for a little off-topic, but I wanted to calrify whether my understanding is right. You said that "Contrary to Spark, a Flink job is executed lazily", however as I read in available sources, for example http://spark.apache.org/docs/latest/programming-guide.html, chapter "RDD operations" : ". The transformations are only computed when an action requires a result to be returned to the driver program.". To my understanding Spark implements the same lazy execution principle as Flink, that is the job is only executed when a data sink/action/execute is called and before that only a execution plan is built. Is that correct or are there other significant differences between Spark and Flink lazy execution approach that I failed to grasp?
Best regards, Filip Łęczycki Pozdrawiam, Filip Łęczycki 2015-12-25 10:17 GMT+01:00 Aljoscha Krettek <aljos...@apache.org>: > Hi Sourav, > you are right, in Flink the equivalent to an RDD would be a DataSet (or a > DataStream if you are working with the streaming API). > > Contrary to Spark, a Flink job is executed lazily when > ExecutionEnvironment.execute() is called. Only then does Flink build an > executable program from the graph of transformations that was built by > calling the transformation methods on DataSet. That’s why I called it lazy. > The operations will also be automatically parallelized. The parallelism of > operations can either be configured in the cluster configuration > (conf/flink-conf.yaml), on a per job basis > (ExecutionEnvironment.setParallelism(int)) or per operation, by calling > setParallelism(int) on a DataSet. > > (Above you can always replace DataSet by DataStream, the same explanations > hold.) > > So, to get back to your question, yes, the operation of reading the file > (or files in a directory) will be parallelized to several worker nodes > based on the previously mentioned settings. > > Let us now if you need more information. > > Cheers, > Aljoscha > > On Thu, 24 Dec 2015 at 16:49 Sourav Mazumder <sourav.mazumde...@gmail.com> > wrote: > >> Hi, >> >> I am new to Flink. Trying to understand some of the basics of Flink. >> >> What is the equivalent of Spark's RDD in Flink ? In my understanding the >> closes think is DataSet API. But wanted to reconfirm. >> >> Also using DataSet API if I ingest a large volume of data (val lines : >> DataSet[String] = env.readTextFile(<some file path and name>)), which may >> not fit in single slave node, will that data get automatically distributed >> in the memory of other slave nodes ? >> >> Regards, >> Sourav >> >