[jira] [Commented] (KAFKA-19999) Transaction coordinator livelock caused by invalid producer epoch

Chia-Ping Tsai (Jira) Tue, 16 Dec 2025 19:15:07 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-19999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18045715#comment-18045715
 ]


Chia-Ping Tsai commented on KAFKA-19999:
----------------------------------------

{quote}
This is not true for 4.0 and 4.1 right? This change was introduced in 4.2? 
https://issues.apache.org/jira/browse/KAFKA-19446
{quote}

you are right! I have updated this ticket

> Transaction coordinator livelock caused by invalid producer epoch
> -----------------------------------------------------------------
>
>                 Key: KAFKA-19999
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19999
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Chia-Ping Tsai
>            Assignee: Chia-Ping Tsai
>            Priority: Blocker
>             Fix For: 4.2.0
>
>
> *case 1: during recovery*
> When a Transaction Coordinator fails over and reloads transactions in 
> PREPARE_COMMIT or PREPARE_ABORT state from the transaction log, it currently 
> reuses the logged producer epoch to send the transaction markers.
> In Transaction V2, brokers enforce strict epoch monotonicity for control 
> batches (markers), requiring marker_epoch > current_epoch. Consequently, the 
> broker rejects the recovery marker with InvalidProducerEpochException.
> {code:java}
>     private void checkProducerEpoch(short producerEpoch, long offset, short 
> transactionVersion) {
>         short current = updatedEntry.producerEpoch();
>         boolean invalidEpoch = (transactionVersion >= 2) ? (producerEpoch <= 
> current) : (producerEpoch < current);
>         if (invalidEpoch) {
>             String comparison = (transactionVersion >= 2) ? "<=" : "<";
>             String message = "Epoch of producer " + producerId + " at offset 
> " + offset + " in " + topicPartition +
>                     " is " + producerEpoch + ", which is " + comparison + " 
> the last seen epoch " + current +
>                     " (TV" + transactionVersion + ")";
>             if (origin == AppendOrigin.REPLICATION) {
>                 log.warn(message);
>             } else {
>                 // Starting from 2.7, we replaced ProducerFenced error with 
> InvalidProducerEpoch in the
>                 // producer send response callback to differentiate from the 
> former fatal exception,
>                 // letting client abort the ongoing transaction and retry.
>                 throw new InvalidProducerEpochException(message);
>             }
>         }
>     }
> {code}
> The coordinator handles this error by removing the pending transaction from 
> memory without writing a COMPLETE state to the transaction log. This leaves 
> the transaction permanently hanging in the PREPARE state. Clients attempting 
> to continue or commit offsets subsequently fail with CONCURRENT_TRANSACTIONS 
> in an infinite retry loop.
> {code:scala}
>                     case Errors.INVALID_PRODUCER_EPOCH |
>                          Errors.TRANSACTION_COORDINATOR_FENCED => // producer 
> or coordinator epoch has changed, this txn can now be ignored
>                       info(s"Sending $transactionalId's transaction marker 
> for partition $topicPartition has permanently failed with error 
> ${error.exceptionName} " +
>                         s"with the current coordinator epoch 
> ${epochAndMetadata.coordinatorEpoch}; cancel sending any more transaction 
> markers $txnMarker to the brokers")
>                       
> txnMarkerChannelManager.removeMarkersForTxn(pendingCompleteTxn)
>                       abortSending = true
> {code}
> *case 2: due to disconnection*
> If a WriteTxnMarkersRequest is persisted by a partition leader but the 
> response is lost (e.g., due to disconnection), the Transaction Coordinator 
> (TC) retries the request using the same producer epoch.
> If the target broker restarts and restores its state before receiving the 
> retry, it rejects the request with InvalidProducerEpochException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-19999) Transaction coordinator livelock caused by invalid producer epoch

Reply via email to