Thanks a lot Zhu Zhu for such an elaborated explanation. On Mon, 21 Oct 2019 at 08:33, Zhu Zhu <reed...@gmail.com> wrote:
> Sources of batch jobs process InputSplit. Each InputSplit can be a file or > a file block according to the FileSystem(for HDFS it is file block). > Sources need to retrieve InputSplits to process from InputSplitAssigner at > JM. > In this way, the assigning of InputSplit to source tasks are possible to > take the InputSplit location and task location into consideration to > support input locality. > > To enable this input locality support, it is required to use a InputFormat > which leverages LocatableInputSplitAssigner and LocatableInputSplit, e.g. > FileInputFormat, HadoopInputFormat, etc. > The file reading source interfaces provided in ExecutionEnvironment, like > #readTextFile and #readFile, use FileInputFormat, so the input locality is > supported by default. > > Thanks, > Zhu Zhu > > Pritam Sadhukhan <sadhukhan.pri...@gmail.com> 于2019年10月21日周一 上午10:17写道: > >> Hi Zhu Zhu, >> >> Thanks for your detailed answer. >> Can you please help me to understand how flink task process the data >> locally on data nodes first? >> I want to understand how flink determines the processing to be done at >> the data nodes? >> >> Regards, >> Pritam. >> >> On Sat, 19 Oct 2019 at 08:16, Zhu Zhu <reed...@gmail.com> wrote: >> >>> Hi Pratam, >>> >>> Flink does not deploy tasks to certain nodes according to source data >>> locations. >>> Instead, it will let a task process local input splits (data on the same >>> node) first. >>> So if your parallelism is large enough to distribute on all the data >>> nodes, most data can be processed locally. >>> >>> Thanks, >>> Zhu Zhu >>> >>> Pritam Sadhukhan <sadhukhan.pri...@gmail.com> 于2019年10月18日周五 上午10:59写道: >>> >>>> Hi, >>>> >>>> I am trying to process data stored on HDFS using flink batch jobs. >>>> Our data is splitted into 16 data nodes. >>>> >>>> I am curious to know how data will be pulled from the data nodes with >>>> the same number of parallelism set as the data split on HDFS i.e. 16. >>>> >>>> Is the flink task being executed locally on the data node server or it >>>> will happen in the flink nodes where data will be pulled remotely? >>>> >>>> Any help will be appreciated. >>>> >>>> Regards, >>>> Pritam. >>>> >>>