Jason Gustafson created KAFKA-7866:
--------------------------------------

             Summary: Duplicate offsets after transaction index append failure
                 Key: KAFKA-7866
                 URL: https://issues.apache.org/jira/browse/KAFKA-7866
             Project: Kafka
          Issue Type: Bug
            Reporter: Jason Gustafson


We have encountered a situation in which an ABORT marker was written 
successfully to the log, but failed to be written to the transaction index. 
This prevented the log end offset from being incremented. This resulted in 
duplicate offsets when the next append was attempted. The broker was using JBOD 
and we would normally expect IOExceptions to cause the log directory to be 
failed. That did not seem to happen here and the duplicates continued for 
several hours.

Unfortunately, we are not sure what the cause of the failure was. 
Significantly, the first duplicate was also the first ABORT marker in the log. 
Unlike the offset and timestamp index, the transaction index is created on 
demand after the first aborted transction. It is likely that the attempt to 
create and open the transaction index failed. There is some suggestion that the 
process may have bumped into the open file limit. Whatever the problem was, it 
also prevented log collection, so we cannot 

Without knowing the underlying cause, we can still consider some potential 
improvements:

1. We probably should be catching non-IO exceptions in the append process. If 
the append to one of the indexes fails, we potentially truncate the log or 
re-throw it as an IOException to ensure that the log directory is no longer 
used.
2. Even without the unexpected exception, there is a small window during which 
even an IOException could lead to duplicate offsets. Marking a log directory 
offline is an asynchronous operation and there is no guarantee that another 
append cannot happen first. Given this, we probably need to detect and truncate 
duplicates during the log recovery process.






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to