Re: Data Locality in Flink

Fabian Hueske Tue, 30 Apr 2019 01:32:28 -0700

Such a decision would require some distribution statistics, preferably
stats on the actual data that needs to be rebalanced or not.
This data would only be available while a job is executed and a component
that changes a running program is very difficult to implement.


Best, Fabian


Am Mo., 29. Apr. 2019 um 15:30 Uhr schrieb Flavio Pompermaier <
pomperma...@okkam.it>:

> Thanks Fabian, that's more clear..many times you don't know when to
> rebalance or not a dataset because it depends on the specific use case and
> dataset distribution.
> An automatic way of choosing whether a Dataset could benefit from a
> rebalance or not could be VERY nice (at least for batch) but I fear this
> would be very hard to implement..am I wrong?
>
> On Mon, Apr 29, 2019 at 3:10 PM Fabian Hueske <fhue...@gmail.com> wrote:
>
>> Hi Flavio,
>>
>> These typos of race conditions are not failure cases, so no exception is
>> thrown.
>> It only means that a single source tasks reads all (or most of the)
>> splits and no splits are left for the other tasks.
>> This can be a problem if a record represents a large amount of IO or an
>> intensive computation as they might not be properly distributed. In that
>> case you'd need to manually rebalance the partitions of a DataSet.
>>
>> Fabian
>>
>> Am Mo., 29. Apr. 2019 um 14:42 Uhr schrieb Flavio Pompermaier <
>> pomperma...@okkam.it>:
>>
>>> Hi Fabian, I wasn't aware that  "race-conditions may happen if your
>>> splits are very small as the first data source task might rapidly request
>>> and process all splits before the other source tasks do their first
>>> request". What happens exactly when a race-condition arise? Is this
>>> exception internally handled by Flink or not?
>>>
>>> On Mon, Apr 29, 2019 at 11:51 AM Fabian Hueske <fhue...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> The method that I described in the SO answer is still implemented in
>>>> Flink.
>>>> Flink tries to assign splits to tasks that run on local TMs.
>>>> However, files are not split per line (this would be horribly
>>>> inefficient) but in larger chunks depending on the number of subtasks (and
>>>> in case of HDFS the file block size).
>>>>
>>>> Best, Fabian
>>>>
>>>> Am So., 28. Apr. 2019 um 18:48 Uhr schrieb Soheil Pourbafrani <
>>>> soheil.i...@gmail.com>:
>>>>
>>>>> Hi
>>>>>
>>>>> I want to exactly how Flink read data in the both case of file in
>>>>> local filesystem and file on distributed file system?
>>>>>
>>>>> In reading data from local file system I guess every line of the file
>>>>> will be read by a slot (according to the job parallelism) for applying the
>>>>> map logic.
>>>>>
>>>>> In reading from HDFS I read this
>>>>> <https://stackoverflow.com/a/39153402/8110607> answer by Fabian Hueske
>>>>> <https://stackoverflow.com/users/3609571/fabian-hueske> and i want to
>>>>> know is that still the Flink strategy fro reading from distributed system
>>>>> file?
>>>>>
>>>>> thanks
>>>>>
>>>>
>>>
>>>
>

Re: Data Locality in Flink

Reply via email to