[ https://issues.apache.org/jira/browse/FLINK-36576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lei Yang updated FLINK-36576: ----------------------------- Description: Currently, the DefaultVertexParallelismAndInputInfosDecider is able to implement a balanced distribution algorithm based on the amount of data and the number of subpartitions, however it also has some limitations: # Currently, Decider selects the data distribution algorithm via the AllToAll or Pointwise attribute of the input, which limits the ability of the operator to dynamically modify the data distribution algorithm. # Doesn't support data volume-based balanced distribution for Pointwise inputs. # For AllToAll type inputs, it does not support splitting the data corresponding to the specific key, i.e., it cannot solve the data skewing caused by single-key hotspot. For that we plan to introduce the following improvements: # Introducing InterInputsKeyCorrelation and IntraInputKeyCorrelation to the input characterisation which allows the operator to flexibly choose the data balanced distribution algorithm. # Introducing a data volume-based data balanced distribution algorithm for Pointwise inputs # Introducing the ability to split data corresponding to the specific key to optimise AllToAll's data volume-based data balancing distribution algorithm. was: Currently, the DefaultVertexParallelismAndInputInfosDecider is able to implement a balanced distribution algorithm based on the amount of data and the number of subpartitions, however it also has some limitations: # Currently, Decider selects the data distribution algorithm via the AllToAll or Pointwise attribute of the input, which limits the ability of the operator to dynamically modify the data distribution algorithm. # Doesn't support data volume-based balanced distribution for Pointwise inputs. # For AllToAll type inputs, it does not support splitting the data corresponding to the specific key, i.e., it cannot solve the data skewing caused by single-key hotspot. For that we plan to introduce the following improvements: # Introducing InterInputsKeyCorrelation and IntraInputKeyCorrelation to the input characterisation which allows the operator to flexibly choose the data balanced distribution algorithm. # Introducing a data volume-based data balanced distribution algorithm for Pointwise inputs # Introducing the ability to split data corresponding to the specific key to optimise AllToAll's data volume-based data balancing distribution algorithm. > Improving amount-based data balancing distribution algorithm for > DefaultVertexParallelismAndInputInfosDecider > ------------------------------------------------------------------------------------------------------------- > > Key: FLINK-36576 > URL: https://issues.apache.org/jira/browse/FLINK-36576 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination > Reporter: Lei Yang > Priority: Major > > Currently, the DefaultVertexParallelismAndInputInfosDecider is able to > implement a balanced distribution algorithm based on the amount of data and > the number of subpartitions, however it also has some limitations: > # Currently, Decider selects the data distribution algorithm via the > AllToAll or Pointwise attribute of the input, which limits the ability of the > operator to dynamically modify the data distribution algorithm. > # Doesn't support data volume-based balanced distribution for Pointwise > inputs. > # For AllToAll type inputs, it does not support splitting the data > corresponding to the specific key, i.e., it cannot solve the data skewing > caused by single-key hotspot. > For that we plan to introduce the following improvements: > # Introducing InterInputsKeyCorrelation and IntraInputKeyCorrelation to the > input characterisation which allows the operator to flexibly choose the data > balanced distribution algorithm. > # Introducing a data volume-based data balanced distribution algorithm for > Pointwise inputs > # Introducing the ability to split data corresponding to the specific key to > optimise AllToAll's data volume-based data balancing distribution algorithm. -- This message was sent by Atlassian Jira (v8.20.10#820010)