Such a decision would require some distribution statistics, preferably
stats on the actual data that needs to be rebalanced or not.
This data would only be available while a job is executed and a component
that changes a running program is very difficult to implement.
Best, Fabian
Am Mo., 29. Ap
Thanks Fabian, that's more clear..many times you don't know when to
rebalance or not a dataset because it depends on the specific use case and
dataset distribution.
An automatic way of choosing whether a Dataset could benefit from a
rebalance or not could be VERY nice (at least for batch) but I fea
Hi Flavio,
These typos of race conditions are not failure cases, so no exception is
thrown.
It only means that a single source tasks reads all (or most of the) splits
and no splits are left for the other tasks.
This can be a problem if a record represents a large amount of IO or an
intensive compu
Hi Fabian, I wasn't aware that "race-conditions may happen if your splits
are very small as the first data source task might rapidly request and
process all splits before the other source tasks do their first request".
What happens exactly when a race-condition arise? Is this exception
internally
Hi,
The method that I described in the SO answer is still implemented in Flink.
Flink tries to assign splits to tasks that run on local TMs.
However, files are not split per line (this would be horribly inefficient)
but in larger chunks depending on the number of subtasks (and in case of
HDFS the