Hi But isnt this behaviour can cause a lot of network activity? Is there any roadmap or plan to change this behaviour? On Apr 26, 2016 7:06 PM, "Fabian Hueske" <fhue...@gmail.com> wrote:
> Hi, > > Flink starts four tasks and then lazily assigns input splits to these tasks > with locality preference. So each task may consume more than one split. > This is different from Hadoop MapReduce or Spark which schedule a new task > for each input split. > In your case, the four tasks would be scheduled to four of the 40 machines > and most of the splits will be remotely read. > > Best, Fabian > > > 2016-04-26 16:59 GMT+02:00 CPC <acha...@gmail.com>: > > > Hi, > > > > I look at some scheduler documentations but could not find answer to my > > question. My question is: suppose that i have a big file on 40 node > hadoop > > cluster and since it is a big file every node has at least one chunk of > the > > file. If i write a flink job and want to filter file and if job has > > parelelism of 4(less that 40 actually) how datalocality is working? Does > > some tasks read some chunks from remote nodes? Or scheduler schedule > tasks > > in way that keeping max paralelism at 4 but schedule tasks on every node? > > > > Regards > > >