Hi Bo Wang, thanks for proposing this design document. I think it is an interesting idea to improve Flink's execution efficiency.
At the moment, the community is actively working on making Flink's scheduler pluggable. Once this is possible, we could try this feature out by implementing a scheduler which supports adaptive parallelism without affecting the existing code. I think this would be a nice approach to further evaluate and benchmark the implications of such a strategy. What do you think? Cheers, Till On Mon, Apr 8, 2019 at 10:28 AM Bo WANG <wbeaglewatc...@gmail.com> wrote: > Hi all, > In distribution computing system, execution parallelism is vital for both > resource efficiency and execution performance. In Flink, execution > parallelism is a pre-specified parameter, which is usually an empirical > value and thus might not be optimal on the various amount of data processed > by each task. > > Furthermore, a fixed parallelism cannot scale to varying data size, which > is common in production cluster, since we may not frequently change the > cluster configuration. > > Thus, we propose adaptively determine the execution parallelism of each > vertex at runtime based on the actual input data size and an ideal data > size processed by each task. The ideal data size is a pre-specified > parameter according to the property of the operator. > > The design doc is ready: > > https://docs.google.com/document/d/1ZxnoJ3SOxUk1PL2xC1t-kepq28Pg20IL6eVKUUOWSKY/edit?usp=sharing > , > any comments are highly appreciated. >