The value of under replicated partitions is 0 across the cluster. Thanks, Shone
On Mon, May 19, 2014 at 12:23 AM, Jun Rao <jun...@gmail.com> wrote: > What's the value of under replicated partitions JMX in each broker? > > Thanks, > > Jun > > > On Sat, May 17, 2014 at 6:16 PM, Paul Mackles <pmack...@adobe.com> wrote: > > > Today we did a rolling restart of ZK. We also restarted the kafka > > controller and ISRs are still not being updated in ZK. Again, the cluster > > seems fine and the replicas in question do appear to be getting updated. > I > > am guessing there must be some bad state persisted in ZK. > > > > On 5/17/14 7:50 PM, "Shone Sadler" <shone.sad...@gmail.com> wrote: > > > > >Hi Jun, > > > > > >I work with Paul and am monitoring the cluster as well. The status has > > >not changed. > > > > > >When we execute kafka-list-topic we are seeing the following (showing > one > > >of two partitions having the problem) > > > > > >topic: t1 partition: 33 leader: 1 replicas: 1,2,3 isr: 1 > > > > > >when inspecting the logs of leader: I do see a spurt of ISR > > >shrinkage/expansion around the time that the brokers were partitioned > > >from > > >ZK. But nothing past the last message "Cached zkVersion [17] not equal > to > > >that in zookeeper." from yesterday. There are not constant changes to > > >the > > >ISR list. > > > > > >Is there a way to force the leader to update ZK with the latest ISR > list? > > > > > >Thanks, > > >Shone > > > > > >Logs: > > > > > >cat server.log | grep "\[t1,33\]" > > > > > >[2014-04-18 10:16:32,814] INFO [ReplicaFetcherManager on broker 1] > > >Removing > > >fetcher for partition [t1,33] (kafka.server.ReplicaFetcherManager) > > >[2014-05-13 19:42:10,784] ERROR [KafkaApi-1] Error when processing fetch > > >request for partition [t1,33] offset 330118156 from consumer with > > >correlation id 0 (kafka.server.KafkaApis) > > >[2014-05-14 11:02:25,255] ERROR [KafkaApi-1] Error when processing fetch > > >request for partition [t1,33] offset 332896470 from consumer with > > >correlation id 0 (kafka.server.KafkaApis) > > >[2014-05-16 12:00:11,344] INFO Partition [t1,33] on broker 1: Shrinking > > >ISR > > >for partition [t1,33] from 3,1,2 to 1 (kafka.cluster.Partition) > > >[2014-05-16 12:00:18,009] INFO Partition [t1,33] on broker 1: Cached > > >zkVersion [17] not equal to that in zookeeper, skip updating ISR > > >(kafka.cluster.Partition) > > >[2014-05-16 13:33:11,344] INFO Partition [t1,33] on broker 1: Shrinking > > >ISR > > >for partition [t1,33] from 3,1,2 to 1 (kafka.cluster.Partition) > > >[2014-05-16 13:33:12,403] INFO Partition [t1,33] on broker 1: Cached > > >zkVersion [17] not equal to that in zookeeper, skip updating ISR > > >(kafka.cluster.Partition) > > > > > > > > >On Sat, May 17, 2014 at 11:44 AM, Jun Rao <jun...@gmail.com> wrote: > > > > > >> Do you see constant ISR shrinking/expansion of those two partitions in > > >>the > > >> leader broker's log ? > > >> > > >> Thanks, > > >> > > >> Jun > > >> > > >> > > >> On Fri, May 16, 2014 at 4:25 PM, Paul Mackles <pmack...@adobe.com> > > >>wrote: > > >> > > >> > Hi - We are running kafka_2.8.0-0.8.0-beta1 (we are a little behind > in > > >> > upgrading). > > >> > > > >> > From what I can tell, connectivity to ZK was lost for a brief > period. > > >>The > > >> > cluster seemed to recover OK except that we now have 2 (out of 125) > > >> > partitions where the ISR appears to be out of date. In other words, > > >> > kafka-list-topic is showing only one replica in the ISR for the 2 > > >> > partitions in question (there should be 3). > > >> > > > >> > What's odd is that in looking at the log segments for those > > >>partitions on > > >> > the file system, I can see that they are in fact getting updated and > > >>by > > >> all > > >> > measures look to be in sync. I can also see that the brokers where > the > > >> > out-of-sync replicas reside are doing fine and leading other > > >>partitions > > >> > like nothing ever happened. Based on that, it seems like the ISR in > > >>ZK is > > >> > just out-of-date due to a botched recovery from the brief ZK outage. > > >> > > > >> > Has anyone seen anything like this before? I saw this ticket which > > >> sounded > > >> > similar: > > >> > > > >> > https://issues.apache.org/jira/browse/KAFKA-948 > > >> > > > >> > Anyone have any suggestions for recovering from this state? I was > > >> thinking > > >> > of running the preferred-replica-election tool next to see if that > > >>gets > > >> the > > >> > ISRs in ZK back in sync. > > >> > > > >> > After that, I guess the next step would be to bounce the kafka > > >>servers in > > >> > question. > > >> > > > >> > Thanks, > > >> > Paul > > >> > > > >> > > > >> > > > > >