[ 
https://issues.apache.org/jira/browse/FLINK-4545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15930088#comment-15930088
 ] 

Stephan Ewen commented on FLINK-4545:
-------------------------------------

There are a few reasons why I think the GlobalBufferPool would be helpful:

  - It is hard to know up front exactly how many network buffers will be 
needed. A TaskManager may get different tasks after a recovery (with more 
channels). In batch jobs, tasks are deployed lazily and a TaskManager may get 
more tasks over time.
  - Having more network memory is for batch jobs generally quite beneficial. 
But how many more? Usually as many as the system can spare.
  - Also on the streaming side, we plan to make use of spare buffers to help 
with better recovery in the future. The change now would make sure that this is 
possible without changing the memory configuration again.
  - Finally, on demand-buffer-allocation can help, but has also a downside: A 
system that reserves/allocates memory up from is more predictable later.

> Flink automatically manages TM network buffer
> ---------------------------------------------
>
>                 Key: FLINK-4545
>                 URL: https://issues.apache.org/jira/browse/FLINK-4545
>             Project: Flink
>          Issue Type: Wish
>          Components: Network
>            Reporter: Zhenzhong Xu
>
> Currently, the number of network buffer per task manager is preconfigured and 
> the memory is pre-allocated through taskmanager.network.numberOfBuffers 
> config. In a Job DAG with shuffle phase, this number can go up very high 
> depends on the TM cluster size. The formula for calculating the buffer count 
> is documented here 
> (https://ci.apache.org/projects/flink/flink-docs-master/setup/config.html#configuring-the-network-buffers).
>   
> #slots-per-TM^2 * #TMs * 4
> In a standalone deployment, we may need to control the task manager cluster 
> size dynamically and then leverage the up-coming Flink feature to support 
> scaling job parallelism/rescaling at runtime. 
> If the buffer count config is static at runtime and cannot be changed without 
> restarting task manager process, this may add latency and complexity for 
> scaling process. I am wondering if there is already any discussion around 
> whether the network buffer should be automatically managed by Flink or at 
> least expose some API to allow it to be reconfigured. Let me know if there is 
> any existing JIRA that I should follow.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to