Re: Data processing with HDFS local or remote

Pritam Sadhukhan Mon, 21 Oct 2019 08:41:55 -0700

Thanks a lot Zhu Zhu for such an elaborated explanation.

On Mon, 21 Oct 2019 at 08:33, Zhu Zhu <reed...@gmail.com> wrote:


> Sources of batch jobs process InputSplit. Each InputSplit can be a file or
> a file block according to the FileSystem(for HDFS it is file block).
> Sources need to retrieve InputSplits to process from InputSplitAssigner at
> JM.
> In this way, the assigning of InputSplit to source tasks are possible to
> take the InputSplit location and task location into consideration to
> support input locality.
>
> To enable this input locality support, it is required to use a InputFormat
> which leverages LocatableInputSplitAssigner and LocatableInputSplit, e.g.
> FileInputFormat, HadoopInputFormat, etc.
> The file reading source interfaces provided in ExecutionEnvironment, like
> #readTextFile and #readFile,  use FileInputFormat, so the input locality is
> supported by default.
>
> Thanks,
> Zhu Zhu
>
> Pritam Sadhukhan <sadhukhan.pri...@gmail.com> 于2019年10月21日周一 上午10:17写道：
>
>> Hi Zhu Zhu,
>>
>> Thanks for your detailed answer.
>> Can you please help me to understand how flink task process the data
>> locally on data nodes first?
>> I want to understand how flink determines the processing to be done at
>> the data nodes?
>>
>> Regards,
>> Pritam.
>>
>> On Sat, 19 Oct 2019 at 08:16, Zhu Zhu <reed...@gmail.com> wrote:
>>
>>> Hi Pratam,
>>>
>>> Flink does not deploy tasks to certain nodes according to source data
>>> locations.
>>> Instead, it will let a task process local input splits (data on the same
>>> node) first.
>>> So if your parallelism is large enough to distribute on all the data
>>> nodes, most data can be processed locally.
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Pritam Sadhukhan <sadhukhan.pri...@gmail.com> 于2019年10月18日周五 上午10:59写道：
>>>
>>>> Hi,
>>>>
>>>> I am trying to process data stored on HDFS using flink batch jobs.
>>>> Our data is splitted into 16 data nodes.
>>>>
>>>> I am curious to know how data will be pulled from the data nodes with
>>>> the same number of parallelism set as the data split on HDFS i.e. 16.
>>>>
>>>> Is the flink task being executed locally on the data node server or it
>>>> will happen in the flink nodes where data will be pulled remotely?
>>>>
>>>> Any help will be appreciated.
>>>>
>>>> Regards,
>>>> Pritam.
>>>>
>>>

Re: Data processing with HDFS local or remote

Reply via email to