Such a decision would require some distribution statistics, preferably stats on the actual data that needs to be rebalanced or not. This data would only be available while a job is executed and a component that changes a running program is very difficult to implement.
Best, Fabian Am Mo., 29. Apr. 2019 um 15:30 Uhr schrieb Flavio Pompermaier < pomperma...@okkam.it>: > Thanks Fabian, that's more clear..many times you don't know when to > rebalance or not a dataset because it depends on the specific use case and > dataset distribution. > An automatic way of choosing whether a Dataset could benefit from a > rebalance or not could be VERY nice (at least for batch) but I fear this > would be very hard to implement..am I wrong? > > On Mon, Apr 29, 2019 at 3:10 PM Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi Flavio, >> >> These typos of race conditions are not failure cases, so no exception is >> thrown. >> It only means that a single source tasks reads all (or most of the) >> splits and no splits are left for the other tasks. >> This can be a problem if a record represents a large amount of IO or an >> intensive computation as they might not be properly distributed. In that >> case you'd need to manually rebalance the partitions of a DataSet. >> >> Fabian >> >> Am Mo., 29. Apr. 2019 um 14:42 Uhr schrieb Flavio Pompermaier < >> pomperma...@okkam.it>: >> >>> Hi Fabian, I wasn't aware that "race-conditions may happen if your >>> splits are very small as the first data source task might rapidly request >>> and process all splits before the other source tasks do their first >>> request". What happens exactly when a race-condition arise? Is this >>> exception internally handled by Flink or not? >>> >>> On Mon, Apr 29, 2019 at 11:51 AM Fabian Hueske <fhue...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> The method that I described in the SO answer is still implemented in >>>> Flink. >>>> Flink tries to assign splits to tasks that run on local TMs. >>>> However, files are not split per line (this would be horribly >>>> inefficient) but in larger chunks depending on the number of subtasks (and >>>> in case of HDFS the file block size). >>>> >>>> Best, Fabian >>>> >>>> Am So., 28. Apr. 2019 um 18:48 Uhr schrieb Soheil Pourbafrani < >>>> soheil.i...@gmail.com>: >>>> >>>>> Hi >>>>> >>>>> I want to exactly how Flink read data in the both case of file in >>>>> local filesystem and file on distributed file system? >>>>> >>>>> In reading data from local file system I guess every line of the file >>>>> will be read by a slot (according to the job parallelism) for applying the >>>>> map logic. >>>>> >>>>> In reading from HDFS I read this >>>>> <https://stackoverflow.com/a/39153402/8110607> answer by Fabian Hueske >>>>> <https://stackoverflow.com/users/3609571/fabian-hueske> and i want to >>>>> know is that still the Flink strategy fro reading from distributed system >>>>> file? >>>>> >>>>> thanks >>>>> >>>> >>> >>> >