[ 
https://issues.apache.org/jira/browse/FLINK-29298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634128#comment-17634128
 ] 

AlexHu commented on FLINK-29298:
--------------------------------

I tested the new github commit, but it still exists the same problem. If I 
change the first for() to 1024 times, this bug will be trigged everytimes.

> LocalBufferPool request buffer from NetworkBufferPool hanging
> -------------------------------------------------------------
>
>                 Key: FLINK-29298
>                 URL: https://issues.apache.org/jira/browse/FLINK-29298
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.16.0
>            Reporter: Weijie Guo
>            Assignee: Weijie Guo
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.17.0
>
>         Attachments: image-2022-09-14-10-52-15-259.png, 
> image-2022-09-14-10-58-45-987.png, image-2022-09-14-11-00-47-309.png
>
>
> In the scenario where the buffer contention is fierce, sometimes the task 
> hang can be observed. Through the thread dump information, we can found that 
> the task thread is blocked by requestMemorySegmentBlocking forever. After 
> investigating the dumped heap information, I found that the NetworkBufferPool 
> actually has many buffers, but the LocalBufferPool is still unavailable and 
> no buffer has been obtained.
> By looking at the code, I am sure that this is a bug in thread race: when the 
> task thread polled out the last buffer in LocalBufferPool and triggered the 
> onGlobalPoolAvailable callback itself, it will skip this notification  (as 
> currently the LocalBufferPool is available), which will cause the BufferPool 
> to eventually become unavailable and will never register a callback to the 
> NetworkBufferPool.
> The conditions for triggering the problem are relatively strict, but I have 
> found a stable way to reproduce it, I will try to fix and verify this problem.
> !image-2022-09-14-10-52-15-259.png|width=1021,height=219!
> !image-2022-09-14-10-58-45-987.png|width=997,height=315!
> !image-2022-09-14-11-00-47-309.png|width=453,height=121!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to