Jason Gustafson created KAFKA-9803:
--------------------------------------

             Summary: Allow producers to recover gracefully from transaction 
timeouts
                 Key: KAFKA-9803
                 URL: https://issues.apache.org/jira/browse/KAFKA-9803
             Project: Kafka
          Issue Type: Improvement
            Reporter: Jason Gustafson


Transaction timeouts are detected by the transaction coordinator. When the 
coordinator detects a timeout, it bumps the producer epoch and aborts the 
transaction. The epoch bump is necessary in order to prevent the current 
producer from being able to begin writing to a new transaction which was not 
started through the coordinator.  

Transactions may also be aborted if a new producer with the same 
`transactional.id` starts up. Similarly this results in an epoch bump. 
Currently the coordinator does not distinguish these two cases. Both will end 
up as a `ProducerFencedException`, which means the producer needs to shut 
itself down. 

We can improve this with the new APIs from KIP-360. When the coordinator times 
out a transaction, it can remember that fact and allow the existing producer to 
claim the bumped epoch and continue. Roughly the logic would work like this:

1. When a transaction times out, set lastProducerEpoch to the current epoch and 
do the normal bump.
2. Any transactional requests from the old epoch result in a new 
TRANSACTION_TIMED_OUT error code, which is propagated to the application.
3. The producer recovers by sending InitProducerId with the current epoch. The 
coordinator returns the bumped epoch.

One issue that needs to be addressed is how to handle INVALID_PRODUCER_EPOCH 
from Produce requests. Partition leaders will not generally know if a bumped 
epoch was the result of a timed out transaction or a fenced producer. Possibly 
the producer can treat these errors as abortable when they come from Produce 
responses. In that case, the user would try to abort the transaction and then 
we can see if it was due to a timeout or otherwise.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to