Artem Livshits created KAFKA-19367: -------------------------------------- Summary: InitProducerId with TV2 double-increments epoch if ongoing transaction is aborted Key: KAFKA-19367 URL: https://issues.apache.org/jira/browse/KAFKA-19367 Project: Kafka Issue Type: Bug Components: core Affects Versions: 4.0.0 Reporter: Artem Livshits
When InitProducerId is handled on the transaction coordinator, the producer epoch is incremented (so that we fence stale requests), then if a transaction was ongoing during this time, it's aborted. With transaction version 2 (a.k.a. KIP-890 part 2), abort increments the producer epoch again (it's the part of the new abort / commit protocol), so the epoch ends up incremented twice. In most cases this is benign, but in the case when the epoch of the ongoing transaction is 32766, it's incremented to 32767 which is max value for short, and then when it's incremented for the second time, it goes negative, and causes illegal argument exception. # First increment happens [here|https://github.com/apache/kafka/blob/b1ea280ab1012e857d4c8354fc57951f9c88f667/core/src/main/scala/kafka/coordinator/transaction/TransactionMetadata.scala#L100] # Second increment happens [here|https://github.com/apache/kafka/blob/b1ea280ab1012e857d4c8354fc57951f9c88f667/core/src/main/scala/kafka/coordinator/transaction/TransactionMetadata.scala#L195] # Illegal argument exception happens [here|https://github.com/apache/kafka/blob/b1ea280ab1012e857d4c8354fc57951f9c88f667/core/src/main/scala/kafka/coordinator/transaction/TransactionMetadata.scala#L289] Most likely the solution would be to just not bump epoch when we reset the fenced state [here|https://github.com/apache/kafka/blob/b1ea280ab1012e857d4c8354fc57951f9c88f667/core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala#L572]. -- This message was sent by Atlassian Jira (v8.20.10#820010)