[ https://issues.apache.org/jira/browse/KAFKA-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Justine Olshan resolved KAFKA-14402. ------------------------------------ Fix Version/s: 4.0.0 Resolution: Fixed > Transactions Server Side Defense > -------------------------------- > > Key: KAFKA-14402 > URL: https://issues.apache.org/jira/browse/KAFKA-14402 > Project: Kafka > Issue Type: Improvement > Affects Versions: 3.5.0 > Reporter: Justine Olshan > Assignee: Justine Olshan > Priority: Major > Fix For: 4.0.0 > > > We have seen hanging transactions in Kafka where the last stable offset (LSO) > does not update, we can’t clean the log (if the topic is compacted), and > read_committed consumers get stuck. > This can happen when a message gets stuck or delayed due to networking issues > or a network partition, the transaction aborts, and then the delayed message > finally comes in. The delayed message case can also violate EOS if the > delayed message comes in after the next addPartitionsToTxn request comes in. > Effectively we may see a message from a previous (aborted) transaction become > part of the next transaction. > Another way hanging transactions can occur is that a client is buggy and may > somehow try to write to a partition before it adds the partition to the > transaction. In both of these cases, we want the server to have some control > to prevent these incorrect records from being written and either causing > hanging transactions or violating Exactly once semantics (EOS) by including > records in the wrong transaction. > The best way to avoid this issue is to: > # *Uniquely identify transactions by bumping the producer epoch after every > commit/abort marker. That way, each transaction can be identified by > (producer id, epoch).* > # {*}Remove the addPartitionsToTxn call and implicitly just add partitions > to the transaction on the first produce request during a transaction{*}. > We avoid the late arrival case because the transaction is uniquely identified > and fenced AND we avoid the buggy client case because we remove the need for > the client to explicitly add partitions to begin the transaction. > Of course, 1 and 2 require client-side changes, so for older clients, those > approaches won’t apply. > 3. *To cover older clients, we will ensure a transaction is ongoing before we > write to a transaction. We can do this by querying the transaction > coordinator and caching the result.* > > See KIP-890 for more information: ** > https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense -- This message was sent by Atlassian Jira (v8.20.10#820010)