[ https://issues.apache.org/jira/browse/FLINK-12002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16806500#comment-16806500 ]
BoWang commented on FLINK-12002: -------------------------------- Hi all, design doc is ready, any comments are appreciated: https://docs.google.com/document/d/1ZxnoJ3SOxUk1PL2xC1t-kepq28Pg20IL6eVKUUOWSKY/edit?usp=sharing > Adaptive Parallelism of Job Vertex Execution > -------------------------------------------- > > Key: FLINK-12002 > URL: https://issues.apache.org/jira/browse/FLINK-12002 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Reporter: ryantaocer > Assignee: BoWang > Priority: Major > > In Flink the parallelism of job is a pre-specified parameter, which is > usually an empirical value and thus might not be optimal for both performance > and resource depending on the amount of data processed in each task. > Furthermore, a fixed parallelism cannot scale to varying data size common in > production cluster where we may not often change configurations. > We propose to determine the job parallelism adaptive to the actual total > input data size and an ideal data size processed by each task. The ideal size > is pre-specified according to the properties of the operator such as the > preparation overhead compared with data processing time. > Our basic idea of "split and merge" is to make the data dispatched evenly > acorss Reducers by spliting and/or merging data buckets produced by Map. The > data density skew problem is not covered. This kind of parallelism adjustment > doesn't have data correctness issue since it doesnt' break the condition that > data with the same key is processed by a single task. We determine the > proper parallelism of Reduce during scheduling before its actual running and > after its input been ready though not necessary total input data. In such > context, apdative parallelism is a better name. This scheduling improvement > we think can benefit both batch and stream as long as we can obtain some > clues about the input data. > Design doc: > https://docs.google.com/document/d/1ZxnoJ3SOxUk1PL2xC1t-kepq28Pg20IL6eVKUUOWSKY/edit?usp=sharing > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)