[ https://issues.apache.org/jira/browse/KAFKA-9840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Gustafson resolved KAFKA-9840. ------------------------------------ Fix Version/s: 2.6.0 Resolution: Fixed > Consumer should not use OffsetForLeaderEpoch without current epoch validation > ----------------------------------------------------------------------------- > > Key: KAFKA-9840 > URL: https://issues.apache.org/jira/browse/KAFKA-9840 > Project: Kafka > Issue Type: Bug > Components: consumer > Affects Versions: 2.4.1 > Reporter: Jason Gustafson > Assignee: Boyang Chen > Priority: Major > Fix For: 2.6.0 > > > We have observed a case where the consumer attempted to detect truncation > with the OffsetsForLeaderEpoch API against a broker which had become a > zombie. In this case, the last epoch known to the consumer was higher than > the last epoch known to the zombie broker, so the broker returned -1 as both > the end offset and epoch in the response. The consumer did not check for this > in the response, which resulted in the following message: > {code} > Truncation detected for partition topic-1 at offset > FetchPosition{offset=11859, offsetEpoch=Optional[46], > currentLeader=LeaderAndEpoch{leader=broker-host (id: 3 rack: null), > epoch=-1}}, resetting offset to the first offset known to diverge > FetchPosition{offset=-1, offsetEpoch=Optional[-1], > currentLeader=LeaderAndEpoch{broker-host (id: 3 rack: null), epoch=-1}} > (org.apache.kafka.clients.consumer.internals.SubscriptionState:414) > {code} > There are a couple ways we the consumer can handle this situation better. > First, the reason we did not detect the zombie broker is that we did not > include the current leader epoch in the OffsetForLeaderEpoch request. This > was likely because of KAFKA-9212. Following this patch, we would not > initialize the current leader epoch from metadata responses because there are > cases that we cannot rely on it. But if the client cannot rely on being able > to detect zombies, then the epoch validation is less useful anyway. So the > simple solution is to not bother with the validation unless we have a > reliable current leader epoch. > Second, the consumer needs to check for the case when the returned offset and > epoch are not defined. In this case, we have to treat this as a normal > OffsetOutOfRange case and invoke the reset policy. -- This message was sent by Atlassian Jira (v8.3.4#803005)