Hi all, we are fighting with offset rewinds of seemingly random size and hitting seemingly random partitions on restarting any node in our kafka cluster. We are running out of ideas - any help or pointers to things to investigate are highly appreciated.
Our kafka setup is dual data center with two local broker clusters (3 nodes each) and two aggregate broker clusters (5 nodes each), the latter running mirror maker to consume messages from the local cluster. Issues seem to have appeared since we upgraded from 0.10.1.0 to 0.11, but not entirely sure it’s related. We first had the theory of too big a consumer offset topic (we use compaction for it) causing the issues on restart, and indeed we found that cleaner threads had died after the upgrade. But restarting and cleaning this topic did not help the issue. Logs are pretty silent when it happens, before we cleaned the consumer offset topic, we got a few of these every time it happened, but no longer now: [2017-08-04 11:19:25,970] ERROR [Group Metadata Manager on Broker 472]: Error loading offsets from __consumer_offsets-14 (kafka.coordinator.group.GroupMetadataManager) java.lang.IllegalStateException: Unexpected unload of active group tns-ticket-store-b144c9d1-425a-4b90-8310-f6e886741494 while loading partition __consumer_offsets-14 at kafka.coordinator.group.GroupMetadataManager$$anonfun$loadGroupsAndOffsets$6.apply(GroupMetadataManager.scala:600) at kafka.coordinator.group.GroupMetadataManager$$anonfun$loadGroupsAndOffsets$6.apply(GroupMetadataManager.scala:595) at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) at kafka.coordinator.group.GroupMetadataManager.loadGroupsAndOffsets(GroupMetadataManager.scala:595) at kafka.coordinator.group.GroupMetadataManager.kafka$coordinator$group$GroupMetadataManager$$doLoadGroupsAndOffsets$1(GroupMetadataManager.scala:455) at kafka.coordinator.group.GroupMetadataManager$$anonfun$loadGroupsForPartition$1.apply$mcV$sp(GroupMetadataManager.scala:441) at kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110) at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Does this seem familiar to anyone? Is there any suggestion as to what to look into closer to investigate this issue? Happy to give more details about anything that might be helpful. Thanks a lot in advance, Christiane