[jira] [Comment Edited] (FLINK-34636) Requesting exclusive buffers timeout causes repeated restarts and cannot be automatically recovered

Vincent Woo (Jira) Mon, 21 Oct 2024 02:43:16 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-34636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891463#comment-17891463
 ]


Vincent Woo edited comment on FLINK-34636 at 10/21/24 9:32 AM:
---------------------------------------------------------------

This issue is occurring in version 1.13.2, and it looks like it may be related 
to this Network buffer leak：Network buffer leak when ResultPartition is 
released (failover), so I'll verify that the fix code avoids this issue first.


was (Author: JIRAUSER299026):
This issue is occurring in version 1.13.2, and it looks like it may be related 
to this Network buffer leak：Network buffer leak when ResultPartition is 
released (failover), so I'll verify that the fix code avoids this issue first.

> Requesting exclusive buffers timeout causes repeated restarts and cannot be 
> automatically recovered
> ---------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-34636
>                 URL: https://issues.apache.org/jira/browse/FLINK-34636
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.13.2
>            Reporter: Vincent Woo
>            Priority: Major
>         Attachments: image-20240308100308649.png, 
> image-20240308101008765.png, image-20240308101407396.png, 
> image-20240308101934756.png
>
>
> Based on the observation of logs and metrics, it was found that a subtask 
> deployed on a same TM consistently reported an exception of requesting 
> exclusive buffers timeout. It was discovered that during the restart process, 
> 【{*}Network{*}】 metric remained unchanged (heap memory usage did change). I 
> suspect that the network buffer memory was not properly released during the 
> restart process, which caused the newly deployed task to fail to obtain the 
> network buffer. This problem persisted despite repeated restarts, and the 
> application failed to recover automatically.
> （I'm not sure if there are other reasons for this issue）
> Attached below are screenshots of the exception stack and relevant metrics:
> {code:java}
> 2024-03-08 09:58:18,738 WARN  org.apache.flink.runtime.taskmanager.Task       
>              [] - GroupWindowAggregate switched from DEPLOYING to FAILED with 
> failure cause: java.io.IOException: Timeout triggered when requesting 
> exclusive buffers: The total number of network buffers is currently set to 
> 32768 of 32768 bytes each. You can increase this number by setting the 
> configuration keys 'taskmanager.memory.network.fraction', 
> 'taskmanager.memory.network.min', and 'taskmanager.memory.network.max',  or 
> you may increase the timeout which is 30000ms by setting the key 
> 'taskmanager.network.memory.exclusive-buffers-request-timeout-ms'.
> at 
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.internalRequestMemorySegments(NetworkBufferPool.java:246)
> at 
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestPooledMemorySegmentsBlocking(NetworkBufferPool.java:169)
> at 
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.reserveSegments(LocalBufferPool.java:247)
> at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setupChannels(SingleInputGate.java:427)
>   
> at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:257)
>   
> at 
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:84)
>   
> at 
> org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:952)
>   
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:655)  
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)  
> at java.lang.Thread.run(Thread.java:748) {code}
> !image-20240308101407396.png|width=866,height=171!
> Network metric：Only this TM is always 100%, without any variation.
> !image-20240308100308649.png|width=868,height=338!
> The status of the task deployed to this TM cannot be RUNNING and the status 
> change is slow
> !image-20240308101008765.png|width=869,height=118!
> Although the root exception thrown by the  application is 
> PartitionNotFoundException, the actual underlying root cause exception log 
> found is IOException: Timeout triggered when requesting exclusive buffers
> !image-20240308101934756.png|width=869,height=394!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-34636) Requesting exclusive buffers timeout causes repeated restarts and cannot be automatically recovered

Reply via email to