Re: Re: [DISCUSSION] Introduce a separated memory pool for the TM merge shuffle

Guowei Ma Tue, 09 Mar 2021 01:28:58 -0800

Hi, all

Thanks all for your suggestions and feedback.
I think it is a good idea that we increase the default size of the
separated pool by testing. I am fine with adding the suffix(".size") to the
config name, which makes it more clear to the user.
But I am a little worried about adding a prefix("framework") because
currently the tm shuffle service is only a shuffle-plugin, which is not a
part of the framework. So maybe we could add a clear explanation in the
document?


Best,
Guowei


On Tue, Mar 9, 2021 at 3:58 PM 曹英杰(北牧) <kevin....@alibaba-inc.com> wrote:

> Thanks for the suggestions. I will do some tests and share the results
> after the implementation is ready. Then we can give a proper default value.
>
> Best,
> Yingjie
>
> ------------------------------------------------------------------
> 发件人：Till Rohrmann<trohrm...@apache.org>
> 日 期：2021年03月05日 23:03:10
> 收件人：Stephan Ewen<se...@apache.org>
> 抄 送：dev<d...@flink.apache.org>; user<user@flink.apache.org>; Xintong Song<
> tonysong...@gmail.com>; 曹英杰(北牧)<kevin....@alibaba-inc.com>; Guowei Ma<
> guowei....@gmail.com>
> 主 题：Re: [DISCUSSION] Introduce a separated memory pool for the TM merge
> shuffle
>
> Thanks for this proposal Guowei. +1 for it.
>
> Concerning the default size, maybe we can run some experiments and see how
> the system behaves with different pool sizes.
>
> Cheers,
> Till
>
> On Fri, Mar 5, 2021 at 2:45 PM Stephan Ewen <se...@apache.org> wrote:
>
>> Thanks Guowei, for the proposal.
>>
>> As discussed offline already, I think this sounds good.
>>
>> One thought is that 16m sounds very small for a default read buffer pool.
>> How risky do you think it is to increase this to 32m or 64m?
>>
>> Best,
>> Stephan
>>
>> On Fri, Mar 5, 2021 at 4:33 AM Guowei Ma <guowei....@gmail.com> wrote:
>>
>>> Hi, all
>>>
>>>
>>> In the Flink 1.12 we introduce the TM merge shuffle. But the
>>> out-of-the-box experience of using TM merge shuffle is not very good. The
>>> main reason is that the default configuration always makes users encounter
>>> OOM [1]. So we hope to introduce a managed memory pool for TM merge shuffle
>>> to avoid the problem.
>>> Goals
>>>
>>>    1. Don't affect the streaming and pipelined-shuffle-only batch
>>>    setups.
>>>    2. Don't mix memory with different life cycle in the same pool.
>>>    E.g., write buffers needed by running tasks and read buffer needed even
>>>    after tasks being finished.
>>>    3. User can use the TM merge shuffle with default memory
>>>    configurations. (May need further tunings for performance optimization, 
>>> but
>>>    should not fail with the default configurations.)
>>>
>>> Proposal
>>>
>>>    1. Introduce a configuration `taskmanager.memory.network.batch-read`
>>>    to specify the size of this memory pool. The default value is 16m.
>>>    2. Allocate the pool lazily. It means that the memory pool would be
>>>    allocated when the TM merge shuffle is used at the first time.
>>>    3. This pool size will not be add up to the TM's total memory size,
>>>    but will be considered part of
>>>    `taskmanager.memory.framework.off-heap.size`. We need to check that the
>>>    pool size is not larger than the framework off-heap size, if TM merge
>>>    shuffle is enabled.
>>>
>>>
>>> In this default configuration, the allocation of the memory pool is
>>> almost impossible to fail. Currently the default framework’s off-heap
>>> memory is 128m, which is mainly used by Netty. But after we introduced zero
>>> copy, the usage of it has been reduced, and you can refer to the detailed
>>> data [2].
>>> Known Limitation
>>> Usability for increasing the memory pool size
>>>
>>> In addition to increasing `taskmanager.memory.network.batch-read`, the
>>> user may also need to adjust `taskmanager.memory.framework.off-heap.size`
>>> at the same time. It also means that once the user forgets this, it is
>>> likely to fail the check when allocating the memory pool.
>>>
>>>
>>> So in the following two situations, we will still prompt the user to
>>> increase the size of `framework.off-heap.size`.
>>>
>>>    1. `taskmanager.memory.network.batch-read` is bigger than
>>>    `taskmanager.memory.framework.off-heap.size`
>>>    2. Allocating the pool encounters the OOM.
>>>
>>>
>>> An alternative is that when the user adjusts the size of the memory
>>> pool, the system automatically adjusts it. But we are not entierly sure
>>> about this, given its implicity and complicating the memory configurations.
>>> Potential memory waste
>>>
>>> In the first step, the memory pool will not be released once allocated. This
>>> means in the first step, even if there is no subsequent batch job, the
>>> pooled memory cannot be used by other consumers.
>>>
>>>
>>> We are not releasing the pool in the first step due to the concern that
>>> frequently allocating/deallocating the entire pool may increase the GC
>>> pressue. Investitations on how to dynamically release the pool when it's no
>>> longer needed is considered a future follow-up.
>>>
>>>
>>> Looking forward to your feedback.
>>>
>>>
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-20740
>>>
>>> [2] https://github.com/apache/flink/pull/7368.
>>> Best,
>>> Guowei
>>>
>>
>

Re: Re: [DISCUSSION] Introduce a separated memory pool for the TM merge shuffle

Reply via email to