[ https://issues.apache.org/jira/browse/FLINK-4545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15632608#comment-15632608 ]
Jamie Grier commented on FLINK-4545: ------------------------------------ Big +1! In general I would love to see this improved. In my experience this is the "one thing" that people run into with Flink, whereas everything else "just works" this one parameter they have to set/tune and it's very confusing to newcomers. The equation to get this right is complex and the "correct" setting changes based on how they deploy the job, what parallelism they use, how many TMs, etc, etc. It also often happens that things are working and then a user changes their job a bit (adding a keyBy for instance) and then it stops working at they have a hard time understanding why. Is there a way we can set this parameter automatically in a majority of use cases? If folks are running single jobs directly on YARN for instance it seems we should have all the information necessary to set this parameter auto-magically or at least fail-fast and tell the the user what the parameter should be set to. > Flink automatically manages TM network buffer > --------------------------------------------- > > Key: FLINK-4545 > URL: https://issues.apache.org/jira/browse/FLINK-4545 > Project: Flink > Issue Type: Wish > Reporter: Zhenzhong Xu > > Currently, the number of network buffer per task manager is preconfigured and > the memory is pre-allocated through taskmanager.network.numberOfBuffers > config. In a Job DAG with shuffle phase, this number can go up very high > depends on the TM cluster size. The formula for calculating the buffer count > is documented here > (https://ci.apache.org/projects/flink/flink-docs-master/setup/config.html#configuring-the-network-buffers). > > #slots-per-TM^2 * #TMs * 4 > In a standalone deployment, we may need to control the task manager cluster > size dynamically and then leverage the up-coming Flink feature to support > scaling job parallelism/rescaling at runtime. > If the buffer count config is static at runtime and cannot be changed without > restarting task manager process, this may add latency and complexity for > scaling process. I am wondering if there is already any discussion around > whether the network buffer should be automatically managed by Flink or at > least expose some API to allow it to be reconfigured. Let me know if there is > any existing JIRA that I should follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)