[ https://issues.apache.org/jira/browse/FLINK-29298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634128#comment-17634128 ]
AlexHu commented on FLINK-29298: -------------------------------- I tested the new github commit, but it still exists the same problem. If I change the first for() to 1024 times, this bug will be trigged everytimes. > LocalBufferPool request buffer from NetworkBufferPool hanging > ------------------------------------------------------------- > > Key: FLINK-29298 > URL: https://issues.apache.org/jira/browse/FLINK-29298 > Project: Flink > Issue Type: Bug > Components: Runtime / Network > Affects Versions: 1.16.0 > Reporter: Weijie Guo > Assignee: Weijie Guo > Priority: Critical > Labels: pull-request-available > Fix For: 1.17.0 > > Attachments: image-2022-09-14-10-52-15-259.png, > image-2022-09-14-10-58-45-987.png, image-2022-09-14-11-00-47-309.png > > > In the scenario where the buffer contention is fierce, sometimes the task > hang can be observed. Through the thread dump information, we can found that > the task thread is blocked by requestMemorySegmentBlocking forever. After > investigating the dumped heap information, I found that the NetworkBufferPool > actually has many buffers, but the LocalBufferPool is still unavailable and > no buffer has been obtained. > By looking at the code, I am sure that this is a bug in thread race: when the > task thread polled out the last buffer in LocalBufferPool and triggered the > onGlobalPoolAvailable callback itself, it will skip this notification (as > currently the LocalBufferPool is available), which will cause the BufferPool > to eventually become unavailable and will never register a callback to the > NetworkBufferPool. > The conditions for triggering the problem are relatively strict, but I have > found a stable way to reproduce it, I will try to fix and verify this problem. > !image-2022-09-14-10-52-15-259.png|width=1021,height=219! > !image-2022-09-14-10-58-45-987.png|width=997,height=315! > !image-2022-09-14-11-00-47-309.png|width=453,height=121! -- This message was sent by Atlassian Jira (v8.20.10#820010)