Guozhang, thanks for these links. Hi Alexis, as Guozhang said, yours seems different from our case. We deleted a topic but caused shrinking/expanding for other topics.
Yifan On Tue, Apr 5, 2016 at 10:02 PM, Alexis Midon <alexis.mi...@airbnb.com> wrote: > I ran into the same issue today. In a production cluster, I noticed the > "Shrinking ISR for partition" log messages for a topic deleted 2 months > ago. > Our staging cluster shows the same messages for all the topics deleted in > that cluster. > Both 0.8.2 > > Yifan, Guozhang, did you find a way to get rid of them? > > thanks in advance, > alexis > > > On Tue, Apr 5, 2016 at 4:16 PM Guozhang Wang <wangg...@gmail.com> wrote: > >> It is possible, there are some discussions about a similar issue in KIP: >> >> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-53+-+Add+custom+policies+for+reconnect+attempts+to+NetworkdClient >> >> mailing thread: >> >> https://www.mail-archive.com/dev@kafka.apache.org/msg46868.html >> >> >> >> Guozhang >> >> On Tue, Apr 5, 2016 at 2:34 PM, Yifan Ying <nafan...@gmail.com> wrote: >> >> > Some updates: >> > >> > Yesterday, right after release (producers and consumers reconnected to >> > Kafka/Zookeeper, but no code change in our producers and consumers), all >> > under replication issues were resolved automatically and no more high >> > latency in both Kafka and Zookeeper. But right after today's >> > release(producers and consumers re-connected again), the under >> replication >> > and high latency issue happened again. So the all-at-once reconnecting >> from >> > producers and consumers would cause the problem? And all these only >> > happened since I deleted a deprecated topic in production. >> > >> > Yifan >> > >> > On Tue, Apr 5, 2016 at 9:04 AM, Guozhang Wang <wangg...@gmail.com> >> wrote: >> > >> >> These configs are mainly dependent on your publish throughput, since >> the >> >> replication throughput is higher bounded by the publish throughput. If >> the >> >> publish throughput is not high, then setting a lower threshold values >> in >> >> these two configs will cause churns in shrinking / expanding ISRs. >> >> >> >> Guozhang >> >> >> >> On Mon, Apr 4, 2016 at 11:55 PM, Yifan Ying <nafan...@gmail.com> >> wrote: >> >> >> >>> Thanks for replying, Guozhang. We did increase both settings: >> >>> >> >>> replica.lag.max.messages=20000 >> >>> >> >>> replica.lag.time.max.ms=20000 >> >>> >> >>> >> >>> But no sure if these are good enough. And yes, that's a good >> suggestion >> >>> to monitor ZK performance. >> >>> >> >>> >> >>> Thanks. >> >>> >> >>> On Mon, Apr 4, 2016 at 8:58 PM, Guozhang Wang <wangg...@gmail.com> >> >>> wrote: >> >>> >> >>>> Hmm, it seems like your broker config "replica.lag.max.messages" and >> " >> >>>> replica.lag.time.max.ms" is mis-configed regarding your replication >> >>>> traffic, and the deletion of the topic actually makes it below the >> >>>> threshold. What are the config values for these two? And could you >> try to >> >>>> increase these configs and see if that helps? >> >>>> >> >>>> In 0.8.2.1 Kafka-consumer-offset-checker.sh access ZK to query the >> >>>> consumer offsets one-by-one, and hence if your ZK read latency is >> high it >> >>>> could take long time. You may want to monitor your ZK cluster >> performance >> >>>> to check its read / write latencies. >> >>>> >> >>>> >> >>>> Guozhang >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> On Mon, Apr 4, 2016 at 10:59 AM, Yifan Ying <nafan...@gmail.com> >> wrote: >> >>>> >> >>>>> Hi Guozhang, >> >>>>> >> >>>>> It's 0.8.2.1. So it should be fixed? We also tried to start from >> >>>>> scratch by wiping out the data directory on both Kafka and >> Zookeeper. And >> >>>>> it's odd that the constant shrinking and expanding happened after >> fresh >> >>>>> restart, and high request latency as well. The brokers are using >> the same >> >>>>> config before topic deletion. >> >>>>> >> >>>>> Another observation is that, using the >> >>>>> Kafka-consumer-offset-checker.sh is extremely slow. Any suggestion >> would be >> >>>>> appreciated! Thanks. >> >>>>> >> >>>>> On Sun, Apr 3, 2016 at 2:29 PM, Guozhang Wang <wangg...@gmail.com> >> >>>>> wrote: >> >>>>> >> >>>>>> Yifan, >> >>>>>> >> >>>>>> Are you on 0.8.0 or 0.8.1/2? There are some issues with zkVersion >> >>>>>> checking >> >>>>>> in 0.8.0 that are fixed in later minor releases of 0.8. >> >>>>>> >> >>>>>> Guozhang >> >>>>>> >> >>>>>> On Fri, Apr 1, 2016 at 7:46 PM, Yifan Ying <nafan...@gmail.com> >> >>>>>> wrote: >> >>>>>> >> >>>>>> > Hi All, >> >>>>>> > >> >>>>>> > We deleted a deprecated topic on Kafka cluster(0.8) and started >> >>>>>> observing >> >>>>>> > constant 'Expanding ISR for partition' and 'Shrinking ISR for >> >>>>>> partition' >> >>>>>> > for other topics. As a result we saw a huge number of under >> >>>>>> replicated >> >>>>>> > partitions and very high request latency from Kafka. And it >> doesn't >> >>>>>> seem >> >>>>>> > able to recover itself. >> >>>>>> > >> >>>>>> > Anyone knows what caused this issue and how to resolve it? >> >>>>>> > >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> -- >> >>>>>> -- Guozhang >> >>>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> -- >> >>>>> Yifan >> >>>>> >> >>>>> >> >>>>> >> >>>> >> >>>> >> >>>> -- >> >>>> -- Guozhang >> >>>> >> >>> >> >>> >> >>> >> >>> -- >> >>> Yifan >> >>> >> >>> >> >>> >> >> >> >> >> >> -- >> >> -- Guozhang >> >> >> > >> > >> > >> > -- >> > Yifan >> > >> > >> > >> >> >> -- >> -- Guozhang >> > -- Yifan