[jira] [Updated] (FLINK-33668) Decoupling Shuffle network memory and job topology

Jiang Xin (Jira) Mon, 27 Nov 2023 19:11:04 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-33668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jiang Xin updated FLINK-33668:
------------------------------
    Description: 
With FLINK-30469  and FLINK-31643, we have decoupled the shuffle network memory 
and the parallelism of tasks by limiting the number of buffers for each 
InputGate and ResultPartition. However, when too many shuffle tasks are running 
simultaneously on the same TaskManager, "Insufficient number of network 
buffers" errors would still occur. This usually happens when Slot Sharing Group 
is enabled or a TaskManager contains multiple slots.

We want to make sure that the TaskManager does not encounter "Insufficient 
number of network buffers" even if there are dozens of InputGates and 
ResultPartitions running on the same TaskManager simultaneously.

  was:
With [FLINK-30469|https://issues.apache.org/jira/browse/FLINK-30469]  and 
[FLINK-31643|https://issues.apache.org/jira/browse/FLINK-31643], we have 
decoupled the shuffle network memory and the parallelism of tasks by limiting 
the number of buffers for each InputGate and ResultPartition. However, when too 
many shuffle tasks are running simultaneously on the same TaskManager, 
"Insufficient number of network buffers" errors would still occur. This usually 
happens when Slot Sharing Group is enabled or a TaskManager contains multiple 
slots.

So we need to make sure that the TaskManager does not encounter "Insufficient 
number of network buffers" even if there are dozens of InputGates and 
ResultPartitions running on the same TaskManager simultaneously.


> Decoupling Shuffle network memory and job topology
> --------------------------------------------------
>
>                 Key: FLINK-33668
>                 URL: https://issues.apache.org/jira/browse/FLINK-33668
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Network
>            Reporter: Jiang Xin
>            Priority: Major
>             Fix For: 1.19.0
>
>
> With FLINK-30469  and FLINK-31643, we have decoupled the shuffle network 
> memory and the parallelism of tasks by limiting the number of buffers for 
> each InputGate and ResultPartition. However, when too many shuffle tasks are 
> running simultaneously on the same TaskManager, "Insufficient number of 
> network buffers" errors would still occur. This usually happens when Slot 
> Sharing Group is enabled or a TaskManager contains multiple slots.
> We want to make sure that the TaskManager does not encounter "Insufficient 
> number of network buffers" even if there are dozens of InputGates and 
> ResultPartitions running on the same TaskManager simultaneously.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-33668) Decoupling Shuffle network memory and job topology

Reply via email to