Re: [DISCUSS] Proposal of external shuffle service

Jin Sun Wed, 31 Oct 2018 15:43:04 -0700

Thanks Zhijiang for the proposal. I like the idea of external shuffle service, 
have left some comments on the document.


> On Oct 31, 2018, at 2:26 AM, Till Rohrmann <trohrm...@apache.org> wrote:
> 
> Thanks for the update Zhijiang! The community is currently quite busy with
> the next Flink release. I hope that we can finish the release in two weeks.
> After that people will become more responsive again.
> 
> Cheers,
> Till
> 
> On Wed, Oct 31, 2018 at 7:49 AM zhijiang <wangzhijiang...@aliyun.com> wrote:
> 
>> I already created the umbrella jira [1] for this improvement, and attched
>> the design doc [2] in this jira.
>> 
>> Welcome for further discussion about the details.
>> 
>> [1] https://issues.apache.org/jira/browse/FLINK-10653
>> [2]
>> https://docs.google.com/document/d/1Jb0Mf46ace-6cLRQxJzo6VNQQVxn3hwf9Zqmv5pcb34/edit?usp=sharing
>> 
>> 
>> <https://docs.google.com/document/d/1Jb0Mf46ace-6cLRQxJzo6VNQQVxn3hwf9Zqmv5pcb34/edit?usp=sharing>
>> Best,
>> Zhijiang
>> 
>> ------------------------------------------------------------------
>> 发件人：Zhijiang(wangzhijiang999) <wangzhijiang...@aliyun.com.INVALID>
>> 发送时间：2018年9月11日(星期二) 15:21
>> 收件人：dev <dev@flink.apache.org>
>> 抄 送：dev <dev@flink.apache.org>
>> 主 题：回复：[DISCUSS] Proposal of external shuffle service
>> 
>> Many thanks Till!
>> 
>> 
>> I would create a JIRA for this feature and design a document attched with it.
>> I will let you know after ready! :)
>> 
>> Best,
>> Zhijiang
>> 
>> 
>> ------------------------------------------------------------------
>> 发件人：Till Rohrmann <trohrm...@apache.org>
>> 发送时间：2018年9月7日(星期五) 22:01
>> 收件人：Zhijiang(wangzhijiang999) <wangzhijiang...@aliyun.com>
>> 抄 送：dev <dev@flink.apache.org>
>> 主 题：Re: [DISCUSS] Proposal of external shuffle service
>> 
>> The rough plan sounds good Zhijiang. I think we should continue with what
>> you've proposed: Open a JIRA issue and creating a design document which
>> outlines the required changes a little bit more in detail. Once this is
>> done, we should link the design document in the JIRA issue and post it here
>> for further discussion.
>> 
>> Cheers,
>> Till
>> 
>> On Wed, Aug 29, 2018 at 6:04 PM Zhijiang(wangzhijiang999) <
>> wangzhijiang...@aliyun.com> wrote:
>> 
>>> Glad to receive your positive feedbacks Till!
>>> 
>>> Actually our motivation is to support batch job well as you mentioned.
>>> 
>>> For output level, flink already has the Subpartition abstraction(writer),
>>> and currently there are PipelinedSubpartition(memory output) and
>>> SpillableSubpartition(one-sp-one-file output) implementations. We can
>>> extend this abstraction to realize other persistent outputs (e.g.
>>> sort-merge-file).
>>> 
>>> For transport level(shuffle service), the current SubpartitionView
>>> abstraction(reader) seems as the brige linked with the output level, then
>> 
>>> the view can understand and read the different output formats. The current
>>> NetworkEnvironment seems take the role of internal shuffle service in
>>> TaskManager and the transport server is realized by netty inside. This
>> 
>>> component can also be started in other external containers like NodeManager
>>> of yarn to take the role of external shuffle service. Further we can
>> 
>>> abstract to extend the shuffle service for transporting outputs by http or
>> 
>>> rdma instead of current netty.  This abstraction should provide the way for
>>> output registration in order to read the results correctly, similar with
>>> current SubpartitionView.
>>> 
>>> The above is still a rough idea. Next I plan to create a feature jira to
>>> cover the related changes if possible. It would be better if getting help
>>> from related committers to review the detail designs together.
>>> 
>>> Best,
>>> Zhijiang
>>> 
>>> ------------------------------------------------------------------
>>> 发件人：Till Rohrmann <trohrm...@apache.org>
>>> 发送时间：2018年8月29日(星期三) 17:36
>>> 收件人：dev <dev@flink.apache.org>; Zhijiang(wangzhijiang999) <
>>> wangzhijiang...@aliyun.com>
>>> 主 题：Re: [DISCUSS] Proposal of external shuffle service
>>> 
>>> Thanks for starting this design discussion Zhijiang!
>>> 
>>> I really like the idea to introduce a ShuffleService abstraction which
>> 
>>> allows to have different implementations depending on the actual use case.
>> 
>>> Especially for batch jobs I can clearly see the benefits of persisting the
>>> results somewhere else.
>>> 
>>> Do you already know which interfaces we need to extend and where to
>>> introduce new abstractions?
>>> 
>>> Cheers,
>>> Till
>>> 
>>> On Mon, Aug 27, 2018 at 1:57 PM Zhijiang(wangzhijiang999)
>>> <wangzhijiang...@aliyun.com.invalid> wrote:
>>> Hi all!
>>> 
>> 
>>> The shuffle service is responsible for transporting upstream produced data
>>> to the downstream side. In flink, the NettyServer is used for network
>> 
>>> transport service and this component is started in the TaskManager process.
>>> That means the TaskManager can support internal shuffle service which
>>> exists some concerns:
>>> 1. If a task finishes, the ResultPartition of this task still retains
>>> registered in TaskManager, because the output buffers have to be
>>> transported by internal shuffle service in TaskManager. That means the
>>> TaskManager can not be released by ResourceManager until ResultPartition
>>> released. It may waste container resources and can not support well for
>>> dynamic resource scenarios.
>>> 2. If we want to expand another shuffle service implementation, the
>>> current mechanism is not easy to handle, because the output level (result
>>> partition) and transport level (shuffle service) are not divided clearly
>>> and loss of abstraction to be extended.
>>> 
>>> For above considerations, we propose the external shuffle service which
>>> can be deployed on any other external contaienrs, e.g. NodeManager
>> 
>>> container in yarn. Then the TaskManager can be released ASAP ifneeded when
>>> all the internal tasks finished. The persistent output files of these
>>> finished tasks can be served to transport by external shuffle service in
>>> the same machine.
>>> 
>>> Further we can abstract both of the output level and transport level to
>> 
>>> support different implementations. e.g. We realized merging the data of all
>> 
>>> the subpartitions into limited persistent local files for disk improvements
>>> in some scenarios instead of one-subpartiton-one-file.
>>> 
>>> I know it may be a big work for doing this, and I just point out some
>>> ideas, and wish getting any feedbacks from you!
>>> 
>>> Best,
>>> Zhijiang
>>> 
>>> 
>>> 
>> 
>> 
>>

Re: [DISCUSS] Proposal of external shuffle service

Reply via email to