Bharath Vissapragada created KAFKA-17862: --------------------------------------------
Summary: [buffer pool] corruption during buffer reuse from the pool Key: KAFKA-17862 URL: https://issues.apache.org/jira/browse/KAFKA-17862 Project: Kafka Issue Type: Bug Components: core Affects Versions: 3.7.1 Reporter: Bharath Vissapragada Attachments: client-config.txt We noticed malformed batches from the Kafka Java client + Redpanda under certain conditions that caused excessive client retries and we narrowed it down to a client bug related to corruption of buffers reused from the buffer pool. We were able to reproduce it with Kafka brokers too, so we are fairly certain the bug is on the client. (Attached the full client config, fwiw) We narrowed it down to a race condition between produce requests and failed batch expiration. If the network flush of produce request races with the expiration, the produce batch that the request uses is corrupted, so a malformed batch is sent to the broker. The expiration is triggered by a timeout [https://github.com/apache/kafka/blob/2c6fb6c54472e90ae17439e62540ef3cb0426fe3/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L392C13-L392C22] that eventually deallocates the batch [https://github.com/apache/kafka/blob/2c6fb6c54472e90ae17439e62540ef3cb0426fe3/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L773] adding it back to the buffer pool [https://github.com/apache/kafka/blob/661bed242e8d7269f134ea2f6a24272ce9b720e9/clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java#L1054] Now it is probably all zeroed out or there is a competing producer that requests a new append that reuses this freed up buffer and starts writing to it corrupting it's contents. If there is racing network flush of a produce batch backed with this buffer, a corrupt batch is sent to the broker resulting in a CRC mismatch. This issue can be easily reproduced in a simulated environment that triggers frequent timeouts (eg: lower timeouts) and then use a producer with high-ish throughput. -- This message was sent by Atlassian Jira (v8.20.10#820010)