Bharath Vissapragada created KAFKA-17862:
--------------------------------------------

             Summary: [buffer pool] corruption during buffer reuse from the pool
                 Key: KAFKA-17862
                 URL: https://issues.apache.org/jira/browse/KAFKA-17862
             Project: Kafka
          Issue Type: Bug
          Components: core
    Affects Versions: 3.7.1
            Reporter: Bharath Vissapragada
         Attachments: client-config.txt

We noticed malformed batches from the Kafka Java client + Redpanda under 
certain conditions that caused excessive client retries and we narrowed it down 
to a client bug related to corruption of buffers reused from the buffer pool. 
We were able to reproduce it with Kafka brokers too, so we are fairly certain 
the bug is on the client.

(Attached the full client config, fwiw)

We narrowed it down to a race condition between produce requests and failed 
batch expiration. If the network flush of produce request races with the 
expiration, the produce batch that the request uses is corrupted, so a 
malformed batch is sent to the broker.

The expiration is triggered by a timeout 
[https://github.com/apache/kafka/blob/2c6fb6c54472e90ae17439e62540ef3cb0426fe3/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L392C13-L392C22]

that eventually deallocates the batch
[https://github.com/apache/kafka/blob/2c6fb6c54472e90ae17439e62540ef3cb0426fe3/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L773]

adding it back to the buffer pool

[https://github.com/apache/kafka/blob/661bed242e8d7269f134ea2f6a24272ce9b720e9/clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java#L1054]

Now it is probably all zeroed out or there is a competing producer that 
requests a new append that reuses this freed up buffer and starts writing to it 
corrupting it's contents.

If there is racing network flush of a produce batch backed with this buffer, a 
corrupt batch is sent to the broker resulting in a CRC mismatch. 

This issue can be easily reproduced in a simulated environment that triggers 
frequent timeouts (eg: lower timeouts) and then use a producer with high-ish 
throughput.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to