[ https://issues.apache.org/jira/browse/FLINK-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17352346#comment-17352346 ]
Jin Xing edited comment on FLINK-15031 at 5/27/21, 9:23 AM: ------------------------------------------------------------ Calculating and announcing the volume of required network memory for shuffle is a really useful functionality which helps a lot to improve user experience and avoid suffering from network memory shortage when shuffle. With the progress of FLIP-156, user will be allowed to set a non-UNKNOWN ResourceProfile for a 'Slot Sharing Group'. It's time to move this ticket forward and take budget of network memory into the consideration of 'Fine-Grained Resource Management'. The idea of PR-10462 is straight – ResourceRequirementsRetriever retrieves network memory requirement from ShuffleMaster and slot allocator requests resource taking network memory into consideration. The major concern is that how much network memory should ShuffleMaster announce. In current Flink, there's one NetworkBufferPool and multiple LocalBufferPools could be created from it. When creating a LocalBufferPool, caller specifies 'minimum buffers' and 'maximum buffers' of the LocalBufferPool. NetworkBufferPool checks if its capacity could satisfy the sum of all minimum requirements of LocalBufferPools. Exception of 'Insufficient number of network buffers' thrown if the checking fails. At first glance the network memory requirement that ShuffleMaster announces for a ExecutionVertex could be the sum of minimum requirements. But it's not always true. NetworkBufferPool allows the unused buffers to be floated between LocalBufferPools. With this mechanism, allocating buffers from NetworkBufferPool could fail if corresponding portion are already floated to other consumers but cannot be returned for a period [1]. In coarse grained resource management, users always configure the capacity of NetworkBufferPool with a relatively safe value, thus the issue of [1] can be easily bypassed. But it could be much more serious if we just init NetworkBufferPool with the minimum requirement in fine grained resource management. From this point of view, we might need to adjust the announcing strategy of network memory and take buffer floating into consideration. A concern of this approach is that whether it could over allocate and decrease memory efficiency. But if user wants to fully avoid the issue of [1], the capacity of NetworkBufferPool should also be big enough to cover buffer floating. What's more it's really hard for a common user to configure the size of NetworkBufferPool with a 'just enough' value, which is safe but there's no waste. Fine grained network memory allocation gives the certainty and accuracy and help user to avoid suffering from the tunning of network memory. What do you think ? Also cc [~trohrmann] [~zhuzh] [~xintongsong] [~karmagyz] [1] https://issues.apache.org/jira/browse/FLINK-12852 was (Author: jinxing6...@126.com): Calculating and announcing the volume of required network memory for shuffle is a really useful functionality which helps a lot to improve user experience and avoid suffering from network memory shortage when shuffle. With the progress of FLIP-156, user will be allowed to set a non-UNKNOWN ResourceProfile for a 'Slot Sharing Group'. It's time to move this ticket forward and take budget of network memory into the consideration of 'Fine-Grained Resource Management'. The idea of PR-10462 is straight – ResourceRequirementsRetriever retrieves network memory requirement from ShuffleMaster and slot allocator requests resource taking network memory into consideration. The major concern is that how much network memory should ShuffleMaster announce. In current Flink, there's one NetworkBufferPool and multiple LocalBufferPools could be created from it. When creating a LocalBufferPool, caller specifies 'minimum buffers' and 'maximum buffers' of the LocalBufferPool. NetworkBufferPool checks if its capacity could satisfy the sum of all minimum requirements of LocalBufferPools. Exception of 'Insufficient number of network buffers' thrown if the checking fails. At first glance the network memory requirement that ShuffleMaster announces for a ExecutionVertex could be the sum of minimum requirements. But it's not always true. NetworkBufferPool allows the unused buffers to be floated between LocalBufferPools. With this mechanism, allocating buffers from NetworkBufferPool could fail if corresponding portion are already floated to other consumers but cannot be returned for a period [1]. In coarse grained resource management, users always configure the capacity of NetworkBufferPool with a relatively safe value, thus the issue of [1] can be easily bypassed. But it could be much more serious if we just init NetworkBufferPool with the minimum requirement in fine grained resource management. From this point of view, we might need to adjust the announcing strategy of network memory and take buffer floating into consideration. A concern of this approach is that whether it could over allocate and decrease memory efficiency. But if user wants to fully avoid the issue of [1], the capacity of NetworkBufferPool should also be big enough to cover buffer floating. What's more it's really hard for a common user to configure the size of NetworkBufferPool with a 'just enough' value, which is safe but there's no waste. Fine grained network memory allocation gives the certainty and accuracy and help user to avoid suffering from the tunning of network memory. Also cc [~xintongsong] [~karmagyz] [1] https://issues.apache.org/jira/browse/FLINK-12852 > Calculate required shuffle memory before allocating slots if resources are > specified > ------------------------------------------------------------------------------------ > > Key: FLINK-15031 > URL: https://issues.apache.org/jira/browse/FLINK-15031 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 1.10.0 > Reporter: Zhu Zhu > Assignee: Zhu Zhu > Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > Time Spent: 10m > Remaining Estimate: 0h > > In cases where resources are specified, we expect each operator to declare > required resources before using them. In this way, no resource related error > should happen if resources are not used beyond what was declared. This > ensures a deployed task would not fail due to insufficient resources in TM, > which may result in unnecessary failures and may even cause a job hanging > forever, failing repeatedly on deploying tasks to a TM with insufficient > resources. > Shuffle memory is the last missing piece for this goal at the moment. Minimum > network buffers are required by tasks to work. Currently a task is possible > to be deployed to a TM with insufficient network buffers, and fails on > launching. > To avoid that, we should calculate required network memory for a > task/SlotSharingGroup before allocating a slot for it. > The required shuffle memory can be derived from the number of required > network buffers. The number of buffers required by a task (ExecutionVertex) is > {code:java} > exclusive buffers for input channels(i.e. numInputChannel * > buffersPerChannel) + required buffers for result partition buffer > pool(currently is numberOfSubpartitions + 1) > {code} > Note that this is for the {{NettyShuffleService}} case. For custom shuffle > services, currently there is no way to get the required shuffle memory of a > task. > To make it simple under dynamic slot sharing, the required shuffle memory for > a task should be the max required shuffle memory of all {{ExecutionVertex}} > of the same {{ExecutionJobVertex}}. And the required shuffle memory for a > slot sharing group should be the sum of shuffle memory for each > {{ExecutionJobVertex}} instance within. -- This message was sent by Atlassian Jira (v8.3.4#803005)