[jira] [Commented] (FLINK-29298) LocalBufferPool request buffer from NetworkBufferPool hanging

Weijie Guo (Jira) Mon, 14 Nov 2022 19:33:04 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-29298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634138#comment-17634138
 ]


Weijie Guo commented on FLINK-29298:
------------------------------------

[~AlexXXX] This pr has not been reviewed, so it may not be the final solution. 
In addition, I'm a little suspicious that this problem may not be the only one 
in the LocalBufferPool, I need to confirm after the problem is fixed. As for 
the change of the first for() statement to 1024, this is expected, because 
there are only 1024 buffers in the networkBufferPool in the test class, and one 
buffer will be taken from the @Before method, so the maximum number of 
applications in line 259 is 1023.

> LocalBufferPool request buffer from NetworkBufferPool hanging
> -------------------------------------------------------------
>
>                 Key: FLINK-29298
>                 URL: https://issues.apache.org/jira/browse/FLINK-29298
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.16.0
>            Reporter: Weijie Guo
>            Assignee: Weijie Guo
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.17.0
>
>         Attachments: image-2022-09-14-10-52-15-259.png, 
> image-2022-09-14-10-58-45-987.png, image-2022-09-14-11-00-47-309.png
>
>
> In the scenario where the buffer contention is fierce, sometimes the task 
> hang can be observed. Through the thread dump information, we can found that 
> the task thread is blocked by requestMemorySegmentBlocking forever. After 
> investigating the dumped heap information, I found that the NetworkBufferPool 
> actually has many buffers, but the LocalBufferPool is still unavailable and 
> no buffer has been obtained.
> By looking at the code, I am sure that this is a bug in thread race: when the 
> task thread polled out the last buffer in LocalBufferPool and triggered the 
> onGlobalPoolAvailable callback itself, it will skip this notification  (as 
> currently the LocalBufferPool is available), which will cause the BufferPool 
> to eventually become unavailable and will never register a callback to the 
> NetworkBufferPool.
> The conditions for triggering the problem are relatively strict, but I have 
> found a stable way to reproduce it, I will try to fix and verify this problem.
> !image-2022-09-14-10-52-15-259.png|width=1021,height=219!
> !image-2022-09-14-10-58-45-987.png|width=997,height=315!
> !image-2022-09-14-11-00-47-309.png|width=453,height=121!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-29298) LocalBufferPool request buffer from NetworkBufferPool hanging

Reply via email to