Hi Christiane, Thanks for the email. That looks like https://issues.apache.org/jira/browse/KAFKA-5600
Ismael On Mon, Aug 7, 2017 at 7:04 PM, Christiane Lemke <christiane.le...@gmail.com > wrote: > Hi all, > > we are fighting with offset rewinds of seemingly random size and hitting > seemingly random partitions on restarting any node in our kafka cluster. We > are running out of ideas - any help or pointers to things to investigate > are highly appreciated. > > Our kafka setup is dual data center with two local broker clusters (3 nodes > each) and two aggregate broker clusters (5 nodes each), the latter running > mirror maker to consume messages from the local cluster. > > Issues seem to have appeared since we upgraded from 0.10.1.0 to 0.11, but > not entirely sure it’s related. > > We first had the theory of too big a consumer offset topic (we use > compaction for it) causing the issues on restart, and indeed we found that > cleaner threads had died after the upgrade. But restarting and cleaning > this topic did not help the issue. > > Logs are pretty silent when it happens, before we cleaned the consumer > offset topic, we got a few of these every time it happened, but no longer > now: > > [2017-08-04 11:19:25,970] ERROR [Group Metadata Manager on Broker 472]: > Error loading offsets from __consumer_offsets-14 > (kafka.coordinator.group.GroupMetadataManager) > java.lang.IllegalStateException: Unexpected unload of active group > tns-ticket-store-b144c9d1-425a-4b90-8310-f6e886741494 while loading > partition __consumer_offsets-14 > at > kafka.coordinator.group.GroupMetadataManager$$anonfun$ > loadGroupsAndOffsets$6.apply(GroupMetadataManager.scala:600) > at > kafka.coordinator.group.GroupMetadataManager$$anonfun$ > loadGroupsAndOffsets$6.apply(GroupMetadataManager.scala:595) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) > at > kafka.coordinator.group.GroupMetadataManager.loadGroupsAndOffsets( > GroupMetadataManager.scala:595) > at > kafka.coordinator.group.GroupMetadataManager.kafka$coordinator$group$ > GroupMetadataManager$$doLoadGroupsAndOffsets$1(GroupMetadataManager.scala: > 455) > at > kafka.coordinator.group.GroupMetadataManager$$anonfun$ > loadGroupsForPartition$1.apply$mcV$sp(GroupMetadataManager.scala:441) > at > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp( > KafkaScheduler.scala:110) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ > ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run( > ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > > Does this seem familiar to anyone? Is there any suggestion as to what to > look into closer to investigate this issue? Happy to give more details > about anything that might be helpful. > > Thanks a lot in advance, > > Christiane >