Problematic new HWM increment behaviour introduced by KIP-207 and KIP-966

Francois Visconte Wed, 11 Sep 2024 23:59:28 -0700

We identified a bug/new behaviour that would lead to consumer lagging for a
long time and ListOffsets requests failing during that time frame.


While the ListOffsets requests failure is expected and has been introduced
by KIP-207
<https://cwiki.apache.org/confluence/display/KAFKA/KIP-207%3A+Offsets+returned+by+ListOffsetsResponse+should+be+monotonically+increasing+even+during+a+partition+leader+change>,
the problematic behavior is more about the inability to increment
highWatermark and the consequence of having lagging consumers.


Here is the situation


   -

   We have a topic with min.isr=2


   -

   We have a partition on broker 16, 17 and 18
   -

   Leader for this partition is broker 17




   1.

   Broker 18 failed. Partition has 2 ISRs
   2.

   Broker 16 failed. Partition has 1 ISR (17)
   3.

   Broker 7 has LEO higher than HWM:

[Broker id=17] Leader topic-86 with topic id Some(yFhPOnPsRDiYHgfF2bR2aQ)
starts at leader epoch 7 from offset 3067193660 with partition epoch 11,
high watermark 3067191497, ISR [10017], adding replicas [] and removing
replicas [] (under-min-isr). Previous leader Some(10017) and previous
leader epoch was 6.

At this point producers cannot produce to topic-86 partition because there
is only one ISR, which is expected behavior.

But it seems that KIP-207 prevent answering to ListOffsets requests here

<https://github.com/apache/kafka/blob/91284d8d7b38d350a63b4086d2f12918e5bc31dc/core/src/main/scala/kafka/cluster/Partition.scala#L1595-L1604>

// Only consider throwing an error if we get a client request
(isolationLevel is defined) and the high watermark

// is lagging behind the start offset

val maybeOffsetsError: Option[ApiException] = leaderEpochStartOffsetOpt

  .filter(epochStart => isolationLevel.isDefined && epochStart >
localLog.highWatermark)

  .map(epochStart => Errors.OFFSET_NOT_AVAILABLE.exception(s"Failed to
fetch offsets for " +

    s"partition $topicPartition with leader $epochLogString as this
partition's " +

    s"high watermark (${localLog.highWatermark}) is lagging behind the " +

    s"start offset from the beginning of this epoch ($epochStart)."))


It seems that the path to get to the HWM being stuck for so long was
introduced in preparation of KIP-966
<https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas>,
see this ticket <https://issues.apache.org/jira/browse/KAFKA-15583> and PR
<https://github.com/apache/kafka/pull/14594>.

As a result:

   -

   The stuck HWM in the above scenario can also mean that a small part of
   messages isn't readable by consumers even though it was in the past.
   -

   In case of truncation, the HWM might still go backwards. This is still
   possible even with min.ISR, although it should be rare.



Regards, F.

Problematic new HWM increment behaviour introduced by KIP-207 and KIP-966

Reply via email to