[jira] [Updated] (FLINK-34636) Requesting exclusive buffers timeout causes repeated restarts and cannot be automatically recovered

Vincent Woo (Jira) Sun, 10 Mar 2024 03:23:04 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-34636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vincent Woo updated FLINK-34636:
--------------------------------
    Description: 
Based on the observation of logs and metrics, it was found that a subtask 
deployed on a same TM consistently reported an exception of requesting 
exclusive buffers timeout. It was discovered that during the restart process, 
【{*}Network{*}】 metric remained unchanged (heap memory usage did change). I 
suspect that the network buffer memory was not properly released during the 
restart process, which caused the newly deployed task to fail to obtain the 
network buffer. This problem persisted despite repeated restarts, and the 
application failed to recover automatically.

（I'm not sure if there are other reasons for this issue）

Attached below are screenshots of the exception stack and relevant metrics:
{code:java}
2024-03-08 09:58:18,738 WARN  org.apache.flink.runtime.taskmanager.Task         
           [] - GroupWindowAggregate switched from DEPLOYING to FAILED with 
failure cause: java.io.IOException: Timeout triggered when requesting exclusive 
buffers: The total number of network buffers is currently set to 32768 of 32768 
bytes each. You can increase this number by setting the configuration keys 
'taskmanager.memory.network.fraction', 'taskmanager.memory.network.min', and 
'taskmanager.memory.network.max',  or you may increase the timeout which is 
30000ms by setting the key 
'taskmanager.network.memory.exclusive-buffers-request-timeout-ms'.
at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.internalRequestMemorySegments(NetworkBufferPool.java:246)
at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestPooledMemorySegmentsBlocking(NetworkBufferPool.java:169)
at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.reserveSegments(LocalBufferPool.java:247)
at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setupChannels(SingleInputGate.java:427)
  
at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:257)
  
at 
org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:84)
  
at 
org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:952)
  
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:655)  
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)  
at java.lang.Thread.run(Thread.java:748) {code}
!image-20240308101407396.png|width=577,height=114!

Network metric：Only this TM is always 100%, without any variation.

!image-20240308100308649.png|width=570,height=222!

The status of the task deployed to this TM cannot be RUNNING and the status 
change is slow

!image-20240308101008765.png|width=567,height=77!

Although the root exception thrown by the  application is 
PartitionNotFoundException, the actual underlying root cause exception log 
found is IOException: Timeout triggered when requesting exclusive buffers

!image-20240308101934756.png|width=567,height=257!

  was:
Based on the observation of logs and metrics, it was found that a subtask 
deployed on a same TM consistently reported an exception of requesting 
exclusive buffers timeout. It was discovered that during the restart process, 
【{*}Network{*}】 metric remained unchanged (heap memory usage did change). I 
suspect that the network buffer memory was not properly released during the 
restart process, which caused the newly deployed task to fail to obtain the 
network buffer. This problem persisted despite repeated restarts, and the 
application failed to recover automatically.

（I'm not sure if there are other reasons for this issue）

Attached below are screenshots of the exception stack and relevant metrics:
{code:java}
2024-03-08 09:58:18,738 WARN  org.apache.flink.runtime.taskmanager.Task         
           [] - GroupWindowAggregate switched from DEPLOYING to FAILED with 
failure cause: java.io.IOException: Timeout triggered when requesting exclusive 
buffers: The total number of network buffers is currently set to 32768 of 32768 
bytes each. You can increase this number by setting the configuration keys 
'taskmanager.memory.network.fraction', 'taskmanager.memory.network.min', and 
'taskmanager.memory.network.max',  or you may increase the timeout which is 
30000ms by setting the key 
'taskmanager.network.memory.exclusive-buffers-request-timeout-ms'.
at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.internalRequestMemorySegments(NetworkBufferPool.java:246)
at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestPooledMemorySegmentsBlocking(NetworkBufferPool.java:169)
at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.reserveSegments(LocalBufferPool.java:247)
at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setupChannels(SingleInputGate.java:427)
  
at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:257)
  
at 
org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:84)
  
at 
org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:952)
  
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:655)  
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)  
at java.lang.Thread.run(Thread.java:748) {code}
!image-20240308101407396.png!

Network metric：Only this TM is always 100%, without any variation.

!image-20240308100308649.png|width=2540,height=989!

The status of the task deployed to this TM cannot be RUNNING and the status 
change is slow

!image-20240308101008765.png!

Although the root exception thrown by the  application is 
PartitionNotFoundException, the actual underlying root cause exception log 
found is IOException: Timeout triggered when requesting exclusive buffers

!image-20240308101934756.png!


> Requesting exclusive buffers timeout causes repeated restarts and cannot be 
> automatically recovered
> ---------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-34636
>                 URL: https://issues.apache.org/jira/browse/FLINK-34636
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>            Reporter: Vincent Woo
>            Priority: Major
>         Attachments: image-20240308100308649.png, 
> image-20240308101008765.png, image-20240308101407396.png, 
> image-20240308101934756.png
>
>
> Based on the observation of logs and metrics, it was found that a subtask 
> deployed on a same TM consistently reported an exception of requesting 
> exclusive buffers timeout. It was discovered that during the restart process, 
> 【{*}Network{*}】 metric remained unchanged (heap memory usage did change). I 
> suspect that the network buffer memory was not properly released during the 
> restart process, which caused the newly deployed task to fail to obtain the 
> network buffer. This problem persisted despite repeated restarts, and the 
> application failed to recover automatically.
> （I'm not sure if there are other reasons for this issue）
> Attached below are screenshots of the exception stack and relevant metrics:
> {code:java}
> 2024-03-08 09:58:18,738 WARN  org.apache.flink.runtime.taskmanager.Task       
>              [] - GroupWindowAggregate switched from DEPLOYING to FAILED with 
> failure cause: java.io.IOException: Timeout triggered when requesting 
> exclusive buffers: The total number of network buffers is currently set to 
> 32768 of 32768 bytes each. You can increase this number by setting the 
> configuration keys 'taskmanager.memory.network.fraction', 
> 'taskmanager.memory.network.min', and 'taskmanager.memory.network.max',  or 
> you may increase the timeout which is 30000ms by setting the key 
> 'taskmanager.network.memory.exclusive-buffers-request-timeout-ms'.
> at 
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.internalRequestMemorySegments(NetworkBufferPool.java:246)
> at 
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestPooledMemorySegmentsBlocking(NetworkBufferPool.java:169)
> at 
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.reserveSegments(LocalBufferPool.java:247)
> at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setupChannels(SingleInputGate.java:427)
>   
> at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:257)
>   
> at 
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:84)
>   
> at 
> org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:952)
>   
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:655)  
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)  
> at java.lang.Thread.run(Thread.java:748) {code}
> !image-20240308101407396.png|width=577,height=114!
> Network metric：Only this TM is always 100%, without any variation.
> !image-20240308100308649.png|width=570,height=222!
> The status of the task deployed to this TM cannot be RUNNING and the status 
> change is slow
> !image-20240308101008765.png|width=567,height=77!
> Although the root exception thrown by the  application is 
> PartitionNotFoundException, the actual underlying root cause exception log 
> found is IOException: Timeout triggered when requesting exclusive buffers
> !image-20240308101934756.png|width=567,height=257!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-34636) Requesting exclusive buffers timeout causes repeated restarts and cannot be automatically recovered

Reply via email to