Hi all, I ever launched the discussion of "Proposal of external shuffle service" before and received very helpful feedbacks, especially with @Andrey Zagrebin's in-depth communication offline. Based on @Till Rohrmann's suggestion, I launch this separate thread again to summarize the current progress and welcome further reviews.
You might encounter such problems before: 1. When the task finishes, the TM is released by ResourceManager soon. But the internal partition has not been transfered completed or not consumed by downstream side yet, which would cause unnecessary failover. 2. When the partition is transfered completed via network, it would be removed from TM immediately. If there are any exceptions during transport or downstream's consumption, the upstream task has to be restarted again to re-produce the data. 3. You may not satisfy with the performance of one-partition-one-file mode for batch jobs in some scenarios, or you want to realize some external shuffle service deployed on YARN/K8S,etc. Or the transport layer is not limited by curernt netty, such as via DFS or RDMA etc. But you might find it is difficult to make changes within current architecture. All the above concerns would be covered by proposed pluggable shuffle manager architecture. If you are interested or wish more details, please refer to the google doc [1] or FLIP [2]. There are also some sub-tasks under going within the umbrella jira [3] . [1] https://docs.google.com/document/d/1l7yIVNH3HATP4BnjEOZFkO2CaHf1sVn_DSxS2llmkd8/edit?usp=sharing [2] https://cwiki.apache.org/confluence/display/FLINK/FLIP-31%3A+Pluggable+Shuffle+Manager [3] https://issues.apache.org/jira/browse/FLINK-10653 Best, Zhijiang