Artem Livshits created KAFKA-19367:
--------------------------------------
Summary: InitProducerId with TV2 double-increments epoch if
ongoing transaction is aborted
Key: KAFKA-19367
URL: https://issues.apache.org/jira/browse/KAFKA-19367
Project: Kafka
Issue Type: Bug
Components: core
Affects Versions: 4.0.0
Reporter: Artem Livshits
When InitProducerId is handled on the transaction coordinator, the producer
epoch is incremented (so that we fence stale requests), then if a transaction
was ongoing during this time, it's aborted. With transaction version 2 (a.k.a.
KIP-890 part 2), abort increments the producer epoch again (it's the part of
the new abort / commit protocol), so the epoch ends up incremented twice.
In most cases this is benign, but in the case when the epoch of the ongoing
transaction is 32766, it's incremented to 32767 which is max value for short,
and then when it's incremented for the second time, it goes negative, and
causes illegal argument exception.
# First increment happens
[here|https://github.com/apache/kafka/blob/b1ea280ab1012e857d4c8354fc57951f9c88f667/core/src/main/scala/kafka/coordinator/transaction/TransactionMetadata.scala#L100]
# Second increment happens
[here|https://github.com/apache/kafka/blob/b1ea280ab1012e857d4c8354fc57951f9c88f667/core/src/main/scala/kafka/coordinator/transaction/TransactionMetadata.scala#L195]
# Illegal argument exception happens
[here|https://github.com/apache/kafka/blob/b1ea280ab1012e857d4c8354fc57951f9c88f667/core/src/main/scala/kafka/coordinator/transaction/TransactionMetadata.scala#L289]
Most likely the solution would be to just not bump epoch when we reset the
fenced state
[here|https://github.com/apache/kafka/blob/b1ea280ab1012e857d4c8354fc57951f9c88f667/core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala#L572].
--
This message was sent by Atlassian Jira
(v8.20.10#820010)