[ https://issues.apache.org/jira/browse/FLINK-34636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891463#comment-17891463 ]
Vincent Woo edited comment on FLINK-34636 at 10/21/24 9:32 AM: --------------------------------------------------------------- This issue is occurring in version 1.13.2, and it looks like it may be related to this Network buffer leak:Network buffer leak when ResultPartition is released (failover), so I'll verify that the fix code avoids this issue first. was (Author: JIRAUSER299026): This issue is occurring in version 1.13.2, and it looks like it may be related to this Network buffer leak:Network buffer leak when ResultPartition is released (failover), so I'll verify that the fix code avoids this issue first. > Requesting exclusive buffers timeout causes repeated restarts and cannot be > automatically recovered > --------------------------------------------------------------------------------------------------- > > Key: FLINK-34636 > URL: https://issues.apache.org/jira/browse/FLINK-34636 > Project: Flink > Issue Type: Bug > Components: Runtime / Network > Affects Versions: 1.13.2 > Reporter: Vincent Woo > Priority: Major > Attachments: image-20240308100308649.png, > image-20240308101008765.png, image-20240308101407396.png, > image-20240308101934756.png > > > Based on the observation of logs and metrics, it was found that a subtask > deployed on a same TM consistently reported an exception of requesting > exclusive buffers timeout. It was discovered that during the restart process, > 【{*}Network{*}】 metric remained unchanged (heap memory usage did change). I > suspect that the network buffer memory was not properly released during the > restart process, which caused the newly deployed task to fail to obtain the > network buffer. This problem persisted despite repeated restarts, and the > application failed to recover automatically. > (I'm not sure if there are other reasons for this issue) > Attached below are screenshots of the exception stack and relevant metrics: > {code:java} > 2024-03-08 09:58:18,738 WARN org.apache.flink.runtime.taskmanager.Task > [] - GroupWindowAggregate switched from DEPLOYING to FAILED with > failure cause: java.io.IOException: Timeout triggered when requesting > exclusive buffers: The total number of network buffers is currently set to > 32768 of 32768 bytes each. You can increase this number by setting the > configuration keys 'taskmanager.memory.network.fraction', > 'taskmanager.memory.network.min', and 'taskmanager.memory.network.max', or > you may increase the timeout which is 30000ms by setting the key > 'taskmanager.network.memory.exclusive-buffers-request-timeout-ms'. > at > org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.internalRequestMemorySegments(NetworkBufferPool.java:246) > at > org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestPooledMemorySegmentsBlocking(NetworkBufferPool.java:169) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.reserveSegments(LocalBufferPool.java:247) > at > org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setupChannels(SingleInputGate.java:427) > > at > org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:257) > > at > org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:84) > > at > org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:952) > > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:655) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566) > at java.lang.Thread.run(Thread.java:748) {code} > !image-20240308101407396.png|width=866,height=171! > Network metric:Only this TM is always 100%, without any variation. > !image-20240308100308649.png|width=868,height=338! > The status of the task deployed to this TM cannot be RUNNING and the status > change is slow > !image-20240308101008765.png|width=869,height=118! > Although the root exception thrown by the application is > PartitionNotFoundException, the actual underlying root cause exception log > found is IOException: Timeout triggered when requesting exclusive buffers > !image-20240308101934756.png|width=869,height=394! -- This message was sent by Atlassian Jira (v8.20.10#820010)