Today we did a rolling restart of ZK. We also restarted the kafka controller and ISRs are still not being updated in ZK. Again, the cluster seems fine and the replicas in question do appear to be getting updated. I am guessing there must be some bad state persisted in ZK.
On 5/17/14 7:50 PM, "Shone Sadler" <shone.sad...@gmail.com> wrote: >Hi Jun, > >I work with Paul and am monitoring the cluster as well. The status has >not changed. > >When we execute kafka-list-topic we are seeing the following (showing one >of two partitions having the problem) > >topic: t1 partition: 33 leader: 1 replicas: 1,2,3 isr: 1 > >when inspecting the logs of leader: I do see a spurt of ISR >shrinkage/expansion around the time that the brokers were partitioned >from >ZK. But nothing past the last message "Cached zkVersion [17] not equal to >that in zookeeper." from yesterday. There are not constant changes to >the >ISR list. > >Is there a way to force the leader to update ZK with the latest ISR list? > >Thanks, >Shone > >Logs: > >cat server.log | grep "\[t1,33\]" > >[2014-04-18 10:16:32,814] INFO [ReplicaFetcherManager on broker 1] >Removing >fetcher for partition [t1,33] (kafka.server.ReplicaFetcherManager) >[2014-05-13 19:42:10,784] ERROR [KafkaApi-1] Error when processing fetch >request for partition [t1,33] offset 330118156 from consumer with >correlation id 0 (kafka.server.KafkaApis) >[2014-05-14 11:02:25,255] ERROR [KafkaApi-1] Error when processing fetch >request for partition [t1,33] offset 332896470 from consumer with >correlation id 0 (kafka.server.KafkaApis) >[2014-05-16 12:00:11,344] INFO Partition [t1,33] on broker 1: Shrinking >ISR >for partition [t1,33] from 3,1,2 to 1 (kafka.cluster.Partition) >[2014-05-16 12:00:18,009] INFO Partition [t1,33] on broker 1: Cached >zkVersion [17] not equal to that in zookeeper, skip updating ISR >(kafka.cluster.Partition) >[2014-05-16 13:33:11,344] INFO Partition [t1,33] on broker 1: Shrinking >ISR >for partition [t1,33] from 3,1,2 to 1 (kafka.cluster.Partition) >[2014-05-16 13:33:12,403] INFO Partition [t1,33] on broker 1: Cached >zkVersion [17] not equal to that in zookeeper, skip updating ISR >(kafka.cluster.Partition) > > >On Sat, May 17, 2014 at 11:44 AM, Jun Rao <jun...@gmail.com> wrote: > >> Do you see constant ISR shrinking/expansion of those two partitions in >>the >> leader broker's log ? >> >> Thanks, >> >> Jun >> >> >> On Fri, May 16, 2014 at 4:25 PM, Paul Mackles <pmack...@adobe.com> >>wrote: >> >> > Hi - We are running kafka_2.8.0-0.8.0-beta1 (we are a little behind in >> > upgrading). >> > >> > From what I can tell, connectivity to ZK was lost for a brief period. >>The >> > cluster seemed to recover OK except that we now have 2 (out of 125) >> > partitions where the ISR appears to be out of date. In other words, >> > kafka-list-topic is showing only one replica in the ISR for the 2 >> > partitions in question (there should be 3). >> > >> > What's odd is that in looking at the log segments for those >>partitions on >> > the file system, I can see that they are in fact getting updated and >>by >> all >> > measures look to be in sync. I can also see that the brokers where the >> > out-of-sync replicas reside are doing fine and leading other >>partitions >> > like nothing ever happened. Based on that, it seems like the ISR in >>ZK is >> > just out-of-date due to a botched recovery from the brief ZK outage. >> > >> > Has anyone seen anything like this before? I saw this ticket which >> sounded >> > similar: >> > >> > https://issues.apache.org/jira/browse/KAFKA-948 >> > >> > Anyone have any suggestions for recovering from this state? I was >> thinking >> > of running the preferred-replica-election tool next to see if that >>gets >> the >> > ISRs in ZK back in sync. >> > >> > After that, I guess the next step would be to bounce the kafka >>servers in >> > question. >> > >> > Thanks, >> > Paul >> > >> > >>