[ 
https://issues.apache.org/jira/browse/FLINK-34636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891463#comment-17891463
 ] 

Vincent Woo commented on FLINK-34636:
-------------------------------------

This issue is occurring in version 1.13.2, and it looks like it may be related 
to this Network buffer leak:Network buffer leak when ResultPartition is 
released (failover), so I'll verify that the fix code avoids this issue first.

> Requesting exclusive buffers timeout causes repeated restarts and cannot be 
> automatically recovered
> ---------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-34636
>                 URL: https://issues.apache.org/jira/browse/FLINK-34636
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.13.2
>            Reporter: Vincent Woo
>            Priority: Major
>         Attachments: image-20240308100308649.png, 
> image-20240308101008765.png, image-20240308101407396.png, 
> image-20240308101934756.png
>
>
> Based on the observation of logs and metrics, it was found that a subtask 
> deployed on a same TM consistently reported an exception of requesting 
> exclusive buffers timeout. It was discovered that during the restart process, 
> 【{*}Network{*}】 metric remained unchanged (heap memory usage did change). I 
> suspect that the network buffer memory was not properly released during the 
> restart process, which caused the newly deployed task to fail to obtain the 
> network buffer. This problem persisted despite repeated restarts, and the 
> application failed to recover automatically.
> (I'm not sure if there are other reasons for this issue)
> Attached below are screenshots of the exception stack and relevant metrics:
> {code:java}
> 2024-03-08 09:58:18,738 WARN  org.apache.flink.runtime.taskmanager.Task       
>              [] - GroupWindowAggregate switched from DEPLOYING to FAILED with 
> failure cause: java.io.IOException: Timeout triggered when requesting 
> exclusive buffers: The total number of network buffers is currently set to 
> 32768 of 32768 bytes each. You can increase this number by setting the 
> configuration keys 'taskmanager.memory.network.fraction', 
> 'taskmanager.memory.network.min', and 'taskmanager.memory.network.max',  or 
> you may increase the timeout which is 30000ms by setting the key 
> 'taskmanager.network.memory.exclusive-buffers-request-timeout-ms'.
> at 
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.internalRequestMemorySegments(NetworkBufferPool.java:246)
> at 
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestPooledMemorySegmentsBlocking(NetworkBufferPool.java:169)
> at 
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.reserveSegments(LocalBufferPool.java:247)
> at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setupChannels(SingleInputGate.java:427)
>   
> at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:257)
>   
> at 
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:84)
>   
> at 
> org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:952)
>   
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:655)  
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)  
> at java.lang.Thread.run(Thread.java:748) {code}
> !image-20240308101407396.png|width=866,height=171!
> Network metric:Only this TM is always 100%, without any variation.
> !image-20240308100308649.png|width=868,height=338!
> The status of the task deployed to this TM cannot be RUNNING and the status 
> change is slow
> !image-20240308101008765.png|width=869,height=118!
> Although the root exception thrown by the  application is 
> PartitionNotFoundException, the actual underlying root cause exception log 
> found is IOException: Timeout triggered when requesting exclusive buffers
> !image-20240308101934756.png|width=869,height=394!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to