Rui Fan created FLINK-32298: ------------------------------- Summary: The outputQueueSize is negative Key: FLINK-32298 URL: https://issues.apache.org/jira/browse/FLINK-32298 Project: Flink Issue Type: Bug Components: Runtime / Network Affects Versions: 1.18.0 Reporter: Rui Fan Assignee: Rui Fan Attachments: image-2023-06-09-17-27-46-429.png
h1. Backgraound The outputQueueSize indicates `The real size of queued output buffers in bytes.`, so it shouldn't be negative. However, it may be negative in some cases. h2. How outputQueueSize is generated? TotalWrittenBytes: *_BufferWritingResultPartition#totalWrittenBytes_* records how many data is written to ResultPartition. TotalSentNumberOfBytes: *_PipelinedSubpartition#totalNumberOfBytes_* records how many data is sent to downstream. The outputQueueSize = TotalWrittenBytes - TotalSentNumberOfBytes. h1. Bug The TotalSentNumberOfBytes may be larger than TotalWrittenBytes due to some data are written to the PipelinedSubpartition without the BufferWritingResultPartition, such as : # PipelinedSubpartition#finishReadRecoveredState writes the `EndOfChannelStateEvent` even if the unaligned checkpoint is disable # PipelinedSubpartition#addRecovered writes channel state(if the job recovered from unaligned checkpoint, the outputQueueSize is totally wrong) # PipelinedSubpartition#finish writes the `EndOfPartitionEvent` !image-2023-06-09-17-27-46-429.png|width=1033,height=296! h1. Solution PipelinedSubpartition should is written through BufferWritingResultPartition, and all writes should be counted. By the way, outputQueueSize doesn't matter because it's just a metric, it doesn't affect data processing. I found this bug because some of our flink scenarios need to use adaptive rebalance (FLINK-31655), I'm developing it in our internal version, which relies on the correct outputQueueSize to select the low pressure channel. -- This message was sent by Atlassian Jira (v8.20.10#820010)