Alexis, Hmm, yours seems a bug in Kafka brokers since your message relates to a topic that has been deleted months ago, indicating that the topic was not deleted cleanly. Could you file a JIRA with server logs for further investigation?
Guozhang On Tue, Apr 5, 2016 at 10:02 PM, Alexis Midon < alexis.mi...@airbnb.com.invalid> wrote: > I ran into the same issue today. In a production cluster, I noticed the > "Shrinking ISR for partition" log messages for a topic deleted 2 months > ago. > Our staging cluster shows the same messages for all the topics deleted in > that cluster. > Both 0.8.2 > > Yifan, Guozhang, did you find a way to get rid of them? > > thanks in advance, > alexis > > > On Tue, Apr 5, 2016 at 4:16 PM Guozhang Wang <wangg...@gmail.com> wrote: > > > It is possible, there are some discussions about a similar issue in KIP: > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-53+-+Add+custom+policies+for+reconnect+attempts+to+NetworkdClient > > > > mailing thread: > > > > https://www.mail-archive.com/dev@kafka.apache.org/msg46868.html > > > > > > > > Guozhang > > > > On Tue, Apr 5, 2016 at 2:34 PM, Yifan Ying <nafan...@gmail.com> wrote: > > > > > Some updates: > > > > > > Yesterday, right after release (producers and consumers reconnected to > > > Kafka/Zookeeper, but no code change in our producers and consumers), > all > > > under replication issues were resolved automatically and no more high > > > latency in both Kafka and Zookeeper. But right after today's > > > release(producers and consumers re-connected again), the under > > replication > > > and high latency issue happened again. So the all-at-once reconnecting > > from > > > producers and consumers would cause the problem? And all these only > > > happened since I deleted a deprecated topic in production. > > > > > > Yifan > > > > > > On Tue, Apr 5, 2016 at 9:04 AM, Guozhang Wang <wangg...@gmail.com> > > wrote: > > > > > >> These configs are mainly dependent on your publish throughput, since > the > > >> replication throughput is higher bounded by the publish throughput. If > > the > > >> publish throughput is not high, then setting a lower threshold values > in > > >> these two configs will cause churns in shrinking / expanding ISRs. > > >> > > >> Guozhang > > >> > > >> On Mon, Apr 4, 2016 at 11:55 PM, Yifan Ying <nafan...@gmail.com> > wrote: > > >> > > >>> Thanks for replying, Guozhang. We did increase both settings: > > >>> > > >>> replica.lag.max.messages=20000 > > >>> > > >>> replica.lag.time.max.ms=20000 > > >>> > > >>> > > >>> But no sure if these are good enough. And yes, that's a good > suggestion > > >>> to monitor ZK performance. > > >>> > > >>> > > >>> Thanks. > > >>> > > >>> On Mon, Apr 4, 2016 at 8:58 PM, Guozhang Wang <wangg...@gmail.com> > > >>> wrote: > > >>> > > >>>> Hmm, it seems like your broker config "replica.lag.max.messages" > and " > > >>>> replica.lag.time.max.ms" is mis-configed regarding your replication > > >>>> traffic, and the deletion of the topic actually makes it below the > > >>>> threshold. What are the config values for these two? And could you > > try to > > >>>> increase these configs and see if that helps? > > >>>> > > >>>> In 0.8.2.1 Kafka-consumer-offset-checker.sh access ZK to query the > > >>>> consumer offsets one-by-one, and hence if your ZK read latency is > > high it > > >>>> could take long time. You may want to monitor your ZK cluster > > performance > > >>>> to check its read / write latencies. > > >>>> > > >>>> > > >>>> Guozhang > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> On Mon, Apr 4, 2016 at 10:59 AM, Yifan Ying <nafan...@gmail.com> > > wrote: > > >>>> > > >>>>> Hi Guozhang, > > >>>>> > > >>>>> It's 0.8.2.1. So it should be fixed? We also tried to start from > > >>>>> scratch by wiping out the data directory on both Kafka and > > Zookeeper. And > > >>>>> it's odd that the constant shrinking and expanding happened after > > fresh > > >>>>> restart, and high request latency as well. The brokers are using > the > > same > > >>>>> config before topic deletion. > > >>>>> > > >>>>> Another observation is that, using the > > >>>>> Kafka-consumer-offset-checker.sh is extremely slow. Any suggestion > > would be > > >>>>> appreciated! Thanks. > > >>>>> > > >>>>> On Sun, Apr 3, 2016 at 2:29 PM, Guozhang Wang <wangg...@gmail.com> > > >>>>> wrote: > > >>>>> > > >>>>>> Yifan, > > >>>>>> > > >>>>>> Are you on 0.8.0 or 0.8.1/2? There are some issues with zkVersion > > >>>>>> checking > > >>>>>> in 0.8.0 that are fixed in later minor releases of 0.8. > > >>>>>> > > >>>>>> Guozhang > > >>>>>> > > >>>>>> On Fri, Apr 1, 2016 at 7:46 PM, Yifan Ying <nafan...@gmail.com> > > >>>>>> wrote: > > >>>>>> > > >>>>>> > Hi All, > > >>>>>> > > > >>>>>> > We deleted a deprecated topic on Kafka cluster(0.8) and started > > >>>>>> observing > > >>>>>> > constant 'Expanding ISR for partition' and 'Shrinking ISR for > > >>>>>> partition' > > >>>>>> > for other topics. As a result we saw a huge number of under > > >>>>>> replicated > > >>>>>> > partitions and very high request latency from Kafka. And it > > doesn't > > >>>>>> seem > > >>>>>> > able to recover itself. > > >>>>>> > > > >>>>>> > Anyone knows what caused this issue and how to resolve it? > > >>>>>> > > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> -- > > >>>>>> -- Guozhang > > >>>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> -- > > >>>>> Yifan > > >>>>> > > >>>>> > > >>>>> > > >>>> > > >>>> > > >>>> -- > > >>>> -- Guozhang > > >>>> > > >>> > > >>> > > >>> > > >>> -- > > >>> Yifan > > >>> > > >>> > > >>> > > >> > > >> > > >> -- > > >> -- Guozhang > > >> > > > > > > > > > > > > -- > > > Yifan > > > > > > > > > > > > > > > -- > > -- Guozhang > > > -- -- Guozhang