Ok. That does indicate the ISR should include all replicas. Which version of ZK server are you using? Could you check the ZK server log to see if there if the ISR is being updated?
Thanks, Jun On Mon, May 19, 2014 at 1:30 AM, Shone Sadler <shone.sad...@gmail.com>wrote: > The value of under replicated partitions is 0 across the cluster. > > Thanks, > Shone > > > On Mon, May 19, 2014 at 12:23 AM, Jun Rao <jun...@gmail.com> wrote: > > > What's the value of under replicated partitions JMX in each broker? > > > > Thanks, > > > > Jun > > > > > > On Sat, May 17, 2014 at 6:16 PM, Paul Mackles <pmack...@adobe.com> > wrote: > > > > > Today we did a rolling restart of ZK. We also restarted the kafka > > > controller and ISRs are still not being updated in ZK. Again, the > cluster > > > seems fine and the replicas in question do appear to be getting > updated. > > I > > > am guessing there must be some bad state persisted in ZK. > > > > > > On 5/17/14 7:50 PM, "Shone Sadler" <shone.sad...@gmail.com> wrote: > > > > > > >Hi Jun, > > > > > > > >I work with Paul and am monitoring the cluster as well. The status > has > > > >not changed. > > > > > > > >When we execute kafka-list-topic we are seeing the following (showing > > one > > > >of two partitions having the problem) > > > > > > > >topic: t1 partition: 33 leader: 1 replicas: 1,2,3 isr: 1 > > > > > > > >when inspecting the logs of leader: I do see a spurt of ISR > > > >shrinkage/expansion around the time that the brokers were partitioned > > > >from > > > >ZK. But nothing past the last message "Cached zkVersion [17] not equal > > to > > > >that in zookeeper." from yesterday. There are not constant changes > to > > > >the > > > >ISR list. > > > > > > > >Is there a way to force the leader to update ZK with the latest ISR > > list? > > > > > > > >Thanks, > > > >Shone > > > > > > > >Logs: > > > > > > > >cat server.log | grep "\[t1,33\]" > > > > > > > >[2014-04-18 10:16:32,814] INFO [ReplicaFetcherManager on broker 1] > > > >Removing > > > >fetcher for partition [t1,33] (kafka.server.ReplicaFetcherManager) > > > >[2014-05-13 19:42:10,784] ERROR [KafkaApi-1] Error when processing > fetch > > > >request for partition [t1,33] offset 330118156 from consumer with > > > >correlation id 0 (kafka.server.KafkaApis) > > > >[2014-05-14 11:02:25,255] ERROR [KafkaApi-1] Error when processing > fetch > > > >request for partition [t1,33] offset 332896470 from consumer with > > > >correlation id 0 (kafka.server.KafkaApis) > > > >[2014-05-16 12:00:11,344] INFO Partition [t1,33] on broker 1: > Shrinking > > > >ISR > > > >for partition [t1,33] from 3,1,2 to 1 (kafka.cluster.Partition) > > > >[2014-05-16 12:00:18,009] INFO Partition [t1,33] on broker 1: Cached > > > >zkVersion [17] not equal to that in zookeeper, skip updating ISR > > > >(kafka.cluster.Partition) > > > >[2014-05-16 13:33:11,344] INFO Partition [t1,33] on broker 1: > Shrinking > > > >ISR > > > >for partition [t1,33] from 3,1,2 to 1 (kafka.cluster.Partition) > > > >[2014-05-16 13:33:12,403] INFO Partition [t1,33] on broker 1: Cached > > > >zkVersion [17] not equal to that in zookeeper, skip updating ISR > > > >(kafka.cluster.Partition) > > > > > > > > > > > >On Sat, May 17, 2014 at 11:44 AM, Jun Rao <jun...@gmail.com> wrote: > > > > > > > >> Do you see constant ISR shrinking/expansion of those two partitions > in > > > >>the > > > >> leader broker's log ? > > > >> > > > >> Thanks, > > > >> > > > >> Jun > > > >> > > > >> > > > >> On Fri, May 16, 2014 at 4:25 PM, Paul Mackles <pmack...@adobe.com> > > > >>wrote: > > > >> > > > >> > Hi - We are running kafka_2.8.0-0.8.0-beta1 (we are a little > behind > > in > > > >> > upgrading). > > > >> > > > > >> > From what I can tell, connectivity to ZK was lost for a brief > > period. > > > >>The > > > >> > cluster seemed to recover OK except that we now have 2 (out of > 125) > > > >> > partitions where the ISR appears to be out of date. In other > words, > > > >> > kafka-list-topic is showing only one replica in the ISR for the 2 > > > >> > partitions in question (there should be 3). > > > >> > > > > >> > What's odd is that in looking at the log segments for those > > > >>partitions on > > > >> > the file system, I can see that they are in fact getting updated > and > > > >>by > > > >> all > > > >> > measures look to be in sync. I can also see that the brokers where > > the > > > >> > out-of-sync replicas reside are doing fine and leading other > > > >>partitions > > > >> > like nothing ever happened. Based on that, it seems like the ISR > in > > > >>ZK is > > > >> > just out-of-date due to a botched recovery from the brief ZK > outage. > > > >> > > > > >> > Has anyone seen anything like this before? I saw this ticket which > > > >> sounded > > > >> > similar: > > > >> > > > > >> > https://issues.apache.org/jira/browse/KAFKA-948 > > > >> > > > > >> > Anyone have any suggestions for recovering from this state? I was > > > >> thinking > > > >> > of running the preferred-replica-election tool next to see if that > > > >>gets > > > >> the > > > >> > ISRs in ZK back in sync. > > > >> > > > > >> > After that, I guess the next step would be to bounce the kafka > > > >>servers in > > > >> > question. > > > >> > > > > >> > Thanks, > > > >> > Paul > > > >> > > > > >> > > > > >> > > > > > > > > >