[jira] [Comment Edited] (FLINK-15031) Calculate required shuffle memory before allocating slots if resources are specified

Jin Xing (Jira) Thu, 27 May 2021 02:24:39 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17352346#comment-17352346
 ]


Jin Xing edited comment on FLINK-15031 at 5/27/21, 9:23 AM:
------------------------------------------------------------

Calculating and announcing the volume of required network memory for shuffle is 
a really useful functionality which helps a lot to improve user experience and 
avoid suffering from network memory shortage when shuffle. With the progress of 
FLIP-156, user will be allowed to set a non-UNKNOWN ResourceProfile for a 'Slot 
Sharing Group'. It's time to move this ticket forward and take budget of 
network memory into the consideration of 'Fine-Grained Resource Management'.

The idea of PR-10462 is straight – ResourceRequirementsRetriever retrieves 
network memory requirement from ShuffleMaster and slot allocator requests 
resource taking network memory into consideration. The major concern is that 
how much network memory should ShuffleMaster announce.

In current Flink, there's one NetworkBufferPool and multiple LocalBufferPools 
could be created from it. When creating a LocalBufferPool, caller specifies 
'minimum buffers' and 'maximum buffers' of the LocalBufferPool. 
NetworkBufferPool checks if its capacity could satisfy the sum of all minimum 
requirements of LocalBufferPools. Exception of 'Insufficient number of network 
buffers' thrown if the checking fails.

At first glance the network memory requirement that ShuffleMaster announces for 
a ExecutionVertex could be the sum of minimum requirements. But it's not always 
true.
 NetworkBufferPool allows the unused buffers to be floated between 
LocalBufferPools. With this mechanism, allocating buffers from 
NetworkBufferPool could fail if corresponding portion are already floated to 
other consumers but cannot be returned for a period [1].

In coarse grained resource management, users always configure the capacity of 
NetworkBufferPool with a relatively safe value, thus the issue of [1] can be 
easily bypassed. But it could be much more serious if we just init 
NetworkBufferPool with the minimum requirement in fine grained resource 
management. From this point of view, we might need to adjust the announcing 
strategy of network memory and take buffer floating into consideration.

A concern of this approach is that whether it could over allocate and decrease 
memory efficiency. But if user wants to fully avoid the issue of [1], the 
capacity of NetworkBufferPool should also be big enough to cover buffer 
floating. What's more it's really hard for a common user to configure the size 
of NetworkBufferPool with a 'just enough' value, which is safe but there's no 
waste. Fine grained network memory allocation gives the certainty and accuracy 
and help user to avoid suffering from the tunning of network memory.

What do you think ?

Also cc [~trohrmann] [~zhuzh] [~xintongsong] [~karmagyz]

 

 

[1] https://issues.apache.org/jira/browse/FLINK-12852


was (Author: jinxing6...@126.com):
Calculating and announcing the volume of required network memory for shuffle is 
a really useful functionality which helps a lot to improve user experience and 
avoid suffering from network memory shortage when shuffle. With the progress of 
FLIP-156, user will be allowed to set a non-UNKNOWN ResourceProfile for a 'Slot 
Sharing Group'. It's time to move this ticket forward and take budget of 
network memory into the consideration of 'Fine-Grained Resource Management'.

The idea of PR-10462 is straight – ResourceRequirementsRetriever retrieves 
network memory requirement from ShuffleMaster and slot allocator requests 
resource taking network memory into consideration. The major concern is that 
how much network memory should ShuffleMaster announce.

In current Flink, there's one NetworkBufferPool and multiple LocalBufferPools 
could be created from it. When creating a LocalBufferPool, caller specifies 
'minimum buffers' and 'maximum buffers' of the LocalBufferPool. 
NetworkBufferPool checks if its capacity could satisfy the sum of all minimum 
requirements of LocalBufferPools. Exception of 'Insufficient number of network 
buffers' thrown if the checking fails.

At first glance the network memory requirement that ShuffleMaster announces for 
a ExecutionVertex could be the sum of minimum requirements. But it's not always 
true.
 NetworkBufferPool allows the unused buffers to be floated between 
LocalBufferPools. With this mechanism, allocating buffers from 
NetworkBufferPool could fail if corresponding portion are already floated to 
other consumers but cannot be returned for a period [1].

In coarse grained resource management, users always configure the capacity of 
NetworkBufferPool with a relatively safe value, thus the issue of [1] can be 
easily bypassed. But it could be much more serious if we just init 
NetworkBufferPool with the minimum requirement in fine grained resource 
management. From this point of view, we might need to adjust the announcing 
strategy of network memory and take buffer floating into consideration.

A concern of this approach is that whether it could over allocate and decrease 
memory efficiency. But if user wants to fully avoid the issue of [1], the 
capacity of NetworkBufferPool should also be big enough to cover buffer 
floating. What's more it's really hard for a common user to configure the size 
of NetworkBufferPool with a 'just enough' value, which is safe but there's no 
waste. Fine grained network memory allocation gives the certainty and accuracy 
and help user to avoid suffering from the tunning of network memory.

Also cc [~xintongsong] [~karmagyz]

 

 

[1] https://issues.apache.org/jira/browse/FLINK-12852

> Calculate required shuffle memory before allocating slots if resources are 
> specified
> ------------------------------------------------------------------------------------
>
>                 Key: FLINK-15031
>                 URL: https://issues.apache.org/jira/browse/FLINK-15031
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.0
>            Reporter: Zhu Zhu
>            Assignee: Zhu Zhu
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.12.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> In cases where resources are specified, we expect each operator to declare 
> required resources before using them. In this way, no resource related error 
> should happen if resources are not used beyond what was declared. This 
> ensures a deployed task would not fail due to insufficient resources in TM, 
> which may result in unnecessary failures and may even cause a job hanging 
> forever, failing repeatedly on deploying tasks to a TM with insufficient 
> resources.
> Shuffle memory is the last missing piece for this goal at the moment. Minimum 
> network buffers are required by tasks to work. Currently a task is possible 
> to be deployed to a TM with insufficient network buffers, and fails on 
> launching.
> To avoid that, we should calculate required network memory for a 
> task/SlotSharingGroup before allocating a slot for it.
> The required shuffle memory can be derived from the number of required 
> network buffers. The number of buffers required by a task (ExecutionVertex) is
> {code:java}
> exclusive buffers for input channels(i.e. numInputChannel * 
> buffersPerChannel) + required buffers for result partition buffer 
> pool(currently is numberOfSubpartitions + 1)
> {code}
> Note that this is for the {{NettyShuffleService}} case. For custom shuffle 
> services, currently there is no way to get the required shuffle memory of a 
> task.
> To make it simple under dynamic slot sharing, the required shuffle memory for 
> a task should be the max required shuffle memory of all {{ExecutionVertex}} 
> of the same {{ExecutionJobVertex}}. And the required shuffle memory for a 
> slot sharing group should be the sum of shuffle memory for each 
> {{ExecutionJobVertex}} instance within.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-15031) Calculate required shuffle memory before allocating slots if resources are specified

Reply via email to