Hi, I have an unusual situation where I have a cluster running Kafka 3.5.1 in strimzi where 4 of the __consumer_offset partitions have dropped under min isr.
Everything else appears to be working fine. Upon investigating, i've found that the partition followers appear to be out of sync with the leader in terms of leader epoch For example the leader-epoch-checkpoint file on the leader partition is 0 4 0 0 1 4 4 6 27 10 while the followers are 0 5 0 0 1 4 4 6 5 7 6 9 which appears to me like the followers are 2 elections ahead of the leader and i'm not sure how they got to this situation. I've attempted to force a new leader election via kafka-leader-elections but it refused for both PREFERRED and UNCLEAN. I've also tried a manual partition assignment to move the leader to another broker but it wont do it. What is even more strange is that if i watch the leader-epoch-checkpoint file on one of the followers I can see it constantly changing as it tries to sort itself out. [kafka@internal-001-kafka-0 __consumer_offsets-18]$ cat leader-epoch-checkpoint 0 3 0 0 1 4 4 6 [kafka@internal-001-kafka-0 __consumer_offsets-18]$ cat leader-epoch-checkpoint 0 5 0 0 1 4 4 6 5 7 6 9 I have tried to manually remove the followers partition files on disk in an attempt to get it to sync from the leader but it keeps returning to the inconsistent state. Restarting the broker with the partition leader on it doesn't seem to move leadership either. The follower keeps logging the following constantly 2024-03-19 09:23:11,169 INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Truncating partition __consumer_offsets-18 with TruncationState(offset=7, completed=true) due to leader epoch and offset EpochEndOffset(errorCode=0, partition=18, leaderEpoch=4, endOffset=10) (kafka.server.ReplicaFetcherThread) [ReplicaFetcherThread-0-1] 2024-03-19 09:23:11,169 INFO [UnifiedLog partition=__consumer_offsets-18, dir=/var/lib/kafka/data-0/kafka-log2] Truncating to offset 7 (kafka.log.UnifiedLog) [ReplicaFetcherThread-0-1] 2024-03-19 09:23:11,174 INFO [UnifiedLog partition=__consumer_offsets-18, dir=/var/lib/kafka/data-0/kafka-log2] Loading producer state till offset 7 with message format version 2 (kafka.log.UnifiedLog$) [ReplicaFetcherThread-0-1] 2024-03-19 09:23:11,174 INFO [UnifiedLog partition=__consumer_offsets-18, dir=/var/lib/kafka/data-0/kafka-log2] Reloading from producer snapshot and rebuilding producer state from offset 7 (kafka.log.UnifiedLog$) [ReplicaFetcherThread-0-1] 2024-03-19 09:23:11,174 INFO [ProducerStateManager partition=__consumer_offsets-18]Loading producer state from snapshot file 'SnapshotFile(offset=7, file=/var/lib/kafka/data-0/kafka-log2/__consumer_offsets-18/00000000000000000007.snapshot)' (org.apache.kafka.storage.internals.log.ProducerStateManager) [ReplicaFetcherThread-0-1] 2024-03-19 09:23:11,175 INFO [UnifiedLog partition=__consumer_offsets-18, dir=/var/lib/kafka/data-0/kafka-log2] Producer state recovery took 1ms for snapshot load and 0ms for segment recovery from offset 7 (kafka.log.UnifiedLog$) [ReplicaFetcherThread-0-1] 2024-03-19 09:23:11,175 WARN [UnifiedLog partition=__consumer_offsets-18, dir=/var/lib/kafka/data-0/kafka-log2] Non-monotonic update of high watermark from (offset=10segment=[0:4083]) to (offset=7segment=[0:3607]) (kafka.log.UnifiedLog) [ReplicaFetcherThread-0-1] Any ideas of how to look at this further? Thanks Karl -- -- The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.