We identified a bug/new behaviour that would lead to consumer lagging for a long time and ListOffsets requests failing during that time frame.
While the ListOffsets requests failure is expected and has been introduced by KIP-207 <https://cwiki.apache.org/confluence/display/KAFKA/KIP-207%3A+Offsets+returned+by+ListOffsetsResponse+should+be+monotonically+increasing+even+during+a+partition+leader+change>, the problematic behavior is more about the inability to increment highWatermark and the consequence of having lagging consumers. Here is the situation - We have a topic with min.isr=2 - We have a partition on broker 16, 17 and 18 - Leader for this partition is broker 17 1. Broker 18 failed. Partition has 2 ISRs 2. Broker 16 failed. Partition has 1 ISR (17) 3. Broker 7 has LEO higher than HWM: [Broker id=17] Leader topic-86 with topic id Some(yFhPOnPsRDiYHgfF2bR2aQ) starts at leader epoch 7 from offset 3067193660 with partition epoch 11, high watermark 3067191497, ISR [10017], adding replicas [] and removing replicas [] (under-min-isr). Previous leader Some(10017) and previous leader epoch was 6. At this point producers cannot produce to topic-86 partition because there is only one ISR, which is expected behavior. But it seems that KIP-207 prevent answering to ListOffsets requests here <https://github.com/apache/kafka/blob/91284d8d7b38d350a63b4086d2f12918e5bc31dc/core/src/main/scala/kafka/cluster/Partition.scala#L1595-L1604> // Only consider throwing an error if we get a client request (isolationLevel is defined) and the high watermark // is lagging behind the start offset val maybeOffsetsError: Option[ApiException] = leaderEpochStartOffsetOpt .filter(epochStart => isolationLevel.isDefined && epochStart > localLog.highWatermark) .map(epochStart => Errors.OFFSET_NOT_AVAILABLE.exception(s"Failed to fetch offsets for " + s"partition $topicPartition with leader $epochLogString as this partition's " + s"high watermark (${localLog.highWatermark}) is lagging behind the " + s"start offset from the beginning of this epoch ($epochStart).")) It seems that the path to get to the HWM being stuck for so long was introduced in preparation of KIP-966 <https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas>, see this ticket <https://issues.apache.org/jira/browse/KAFKA-15583> and PR <https://github.com/apache/kafka/pull/14594>. As a result: - The stuck HWM in the above scenario can also mean that a small part of messages isn't readable by consumers even though it was in the past. - In case of truncation, the HWM might still go backwards. This is still possible even with min.ISR, although it should be rare. Regards, F.