Yingjie Cao created FLINK-31386:
-----------------------------------
Summary: Fix the potential deadlock issue of blocking shuffle
Key: FLINK-31386
URL: https://issues.apache.org/jira/browse/FLINK-31386
Project: Flink
Issue Type: Bug
Components: Runtime / Network
Reporter: Yingjie Cao
Fix For: 1.17.0
Currently, theĀ SortMergeResultPartition may allocate more network buffers than
the guaranteed size of the LocalBufferPool. As a result, some result partitions
may need to wait other result partitions to release the over-allocated network
buffers to continue. However, the result partitions which have allocated more
than guaranteed buffers relies on the processing of input data to trigger data
spilling and buffer recycling. The input data further relies on batch reading
buffers used by theĀ SortMergeResultPartitionReadScheduler which may already
taken by those blocked result partitions which are waiting for buffers. Then
deadlock occurs. We can easily fix this deadlock by reserving the guaranteed
buffers on initializing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)