Weijie Guo created FLINK-28925: ---------------------------------- Summary: Fix the concurrency problem in hybrid shuffle Key: FLINK-28925 URL: https://issues.apache.org/jira/browse/FLINK-28925 Project: Flink Issue Type: Bug Components: Runtime / Network Affects Versions: 1.16.0 Reporter: Weijie Guo Fix For: 1.16.0
Through tpc-ds testing and code analysis, I found some thread unsafe problems in hybrid shuffle: # HsSubpartitionMemeoryDataManager#consumeBuffer should return a readOnlySlice buffer to downstream instead of original buffer: If the spilling thread is processing while downstream task is consuming the same buffer, the amount of data written to the disk will be smaller than the actual value. To solve this, we should let the consuming thread and the spilling thread share the same data but not index. # HsSubpartitionMemoryDataManager#releaseSubpartitionBuffers should ignore the release decision if the buffer already removed from bufferIndexToContexts instead of throw an exception. -- This message was sent by Atlassian Jira (v8.20.10#820010)