Re: Data locality and scheduler

Fabian Hueske Thu, 28 Apr 2016 03:01:08 -0700

Hi,

yes, that can cause network traffic.
AFAIK, there are no plans to work on behavior.


Best, Fabian

2016-04-26 18:17 GMT+02:00 CPC <acha...@gmail.com>:

> Hi
>
> But isnt this behaviour can cause a lot of network activity? Is there any
> roadmap or plan to change this behaviour?
> On Apr 26, 2016 7:06 PM, "Fabian Hueske" <fhue...@gmail.com> wrote:
>
> > Hi,
> >
> > Flink starts four tasks and then lazily assigns input splits to these
> tasks
> > with locality preference. So each task may consume more than one split.
> > This is different from Hadoop MapReduce or Spark which schedule a new
> task
> > for each input split.
> > In your case, the four tasks would be scheduled to four of the 40
> machines
> > and most of the splits will be remotely read.
> >
> > Best, Fabian
> >
> >
> > 2016-04-26 16:59 GMT+02:00 CPC <acha...@gmail.com>:
> >
> > > Hi,
> > >
> > > I look at some scheduler documentations but could not find answer to my
> > > question. My question is: suppose that i have a big file on 40 node
> > hadoop
> > > cluster and since it is a big file every node has at least one chunk of
> > the
> > > file. If i write a flink job and want to filter file and if job has
> > > parelelism of 4(less that 40 actually) how datalocality is working?
> Does
> > > some tasks read some chunks from remote nodes? Or scheduler schedule
> > tasks
> > > in way that keeping max paralelism at 4 but schedule tasks on every
> node?
> > >
> > > Regards
> > >
> >
>

Re: Data locality and scheduler

Reply via email to