[
https://issues.apache.org/jira/browse/KAFKA-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Justine Olshan resolved KAFKA-14402.
------------------------------------
Fix Version/s: 4.1.0
(was: 4.2.0)
Resolution: Fixed
> Transactions Server Side Defense
> --------------------------------
>
> Key: KAFKA-14402
> URL: https://issues.apache.org/jira/browse/KAFKA-14402
> Project: Kafka
> Issue Type: Improvement
> Affects Versions: 3.5.0
> Reporter: Justine Olshan
> Assignee: Justine Olshan
> Priority: Major
> Fix For: 4.1.0
>
>
> We have seen hanging transactions in Kafka where the last stable offset (LSO)
> does not update, we can’t clean the log (if the topic is compacted), and
> read_committed consumers get stuck.
> This can happen when a message gets stuck or delayed due to networking issues
> or a network partition, the transaction aborts, and then the delayed message
> finally comes in. The delayed message case can also violate EOS if the
> delayed message comes in after the next addPartitionsToTxn request comes in.
> Effectively we may see a message from a previous (aborted) transaction become
> part of the next transaction.
> Another way hanging transactions can occur is that a client is buggy and may
> somehow try to write to a partition before it adds the partition to the
> transaction. In both of these cases, we want the server to have some control
> to prevent these incorrect records from being written and either causing
> hanging transactions or violating Exactly once semantics (EOS) by including
> records in the wrong transaction.
> The best way to avoid this issue is to:
> # *Uniquely identify transactions by bumping the producer epoch after every
> commit/abort marker. That way, each transaction can be identified by
> (producer id, epoch).*
> # {*}Remove the addPartitionsToTxn call and implicitly just add partitions
> to the transaction on the first produce request during a transaction{*}.
> We avoid the late arrival case because the transaction is uniquely identified
> and fenced AND we avoid the buggy client case because we remove the need for
> the client to explicitly add partitions to begin the transaction.
> Of course, 1 and 2 require client-side changes, so for older clients, those
> approaches won’t apply.
> 3. *To cover older clients, we will ensure a transaction is ongoing before we
> write to a transaction. We can do this by querying the transaction
> coordinator and caching the result.*
>
> See KIP-890 for more information: **
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense
--
This message was sent by Atlassian Jira
(v8.20.10#820010)