[ https://issues.apache.org/jira/browse/FLINK-28925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weijie Guo updated FLINK-28925: ------------------------------- Description: Through tpc-ds testing and code analysis, I found some thread unsafe problems in hybrid shuffle: # HsSubpartitionMemeoryDataManager#consumeBuffer should return a readOnlySlice buffer to downstream instead of original buffer: If the spilling thread is processing while downstream task is consuming the same buffer, the amount of data written to the disk will be smaller than the actual value. To solve this, we should let the consuming thread and the spilling thread share the same data but not index. # HsSubpartitionMemoryDataManager#releaseSubpartitionBuffers should ignore the release decision if the buffer already removed from bufferIndexToContexts instead of throw an exception. It should be pointed out that although the actual release operation is synchronous, a double release can still happen. The reason is that non-global decisions do not need to be synchronized. In other words, the main task thread and the consumer thread may decide to release a buffer at the same time. was: Through tpc-ds testing and code analysis, I found some thread unsafe problems in hybrid shuffle: # HsSubpartitionMemeoryDataManager#consumeBuffer should return a readOnlySlice buffer to downstream instead of original buffer: If the spilling thread is processing while downstream task is consuming the same buffer, the amount of data written to the disk will be smaller than the actual value. To solve this, we should let the consuming thread and the spilling thread share the same data but not index. # HsSubpartitionMemoryDataManager#releaseSubpartitionBuffers should ignore the release decision if the buffer already removed from bufferIndexToContexts instead of throw an exception. > Fix the concurrency problem in hybrid shuffle > --------------------------------------------- > > Key: FLINK-28925 > URL: https://issues.apache.org/jira/browse/FLINK-28925 > Project: Flink > Issue Type: Bug > Components: Runtime / Network > Affects Versions: 1.16.0 > Reporter: Weijie Guo > Priority: Blocker > Fix For: 1.16.0 > > > Through tpc-ds testing and code analysis, I found some thread unsafe problems > in hybrid shuffle: > # HsSubpartitionMemeoryDataManager#consumeBuffer should return a > readOnlySlice buffer to downstream instead of original buffer: If the > spilling thread is processing while downstream task is consuming the same > buffer, the amount of data written to the disk will be smaller than the > actual value. To solve this, we should let the consuming thread and the > spilling thread share the same data but not index. > # HsSubpartitionMemoryDataManager#releaseSubpartitionBuffers should ignore > the release decision if the buffer already removed from bufferIndexToContexts > instead of throw an exception. It should be pointed out that although the > actual release operation is synchronous, a double release can still happen. > The reason is that non-global decisions do not need to be synchronized. In > other words, the main task thread and the consumer thread may decide to > release a buffer at the same time. -- This message was sent by Atlassian Jira (v8.20.10#820010)